在上面表现的随机初始评估以及如何找到它们

论文标题

在上面表现的随机初始评估以及如何找到它们

Random initialisations performing above chance and how to find them

论文作者

Benzing, Frederik, Schug, Simon, Meier, Robert, von Oswald, Johannes, Akram, Yassir, Zucchet, Nicolas, Aitchison, Laurence, Steger, Angelika

论文摘要

从不同的随机初始化开始的随机梯度下降（SGD）训练的神经网络通常在功能上非常相似，从而提出了一个问题，即不同的SGD溶液之间是否存在有意义的差异。 Entezari等人最近推测，尽管初始化不同，但SGD发现的解决方案在考虑到神经网络的排列不变性后，位于相同的损失谷中。具体而言，他们假设可以将SGD发现的任何两种解决方案排列，以使其参数之间的线性插值形成一条路径，而不会显着增加损失。在这里，我们使用一种简单但功能强大的算法来找到这样的排列，使我们能够获得直接的经验证据，证明该假设在完全连接的网络中是正确的。引人注目的是，我们发现在初始化时已经存在两个网络，并且平均它们随机的，但适当排列的初始化的性能大大高于机会。相反，对于卷积架构，我们的证据表明该假设不存在。特别是在大型学习率制度中，SGD似乎发现了各种模式。

Neural networks trained with stochastic gradient descent (SGD) starting from different random initialisations typically find functionally very similar solutions, raising the question of whether there are meaningful differences between different SGD solutions. Entezari et al.\ recently conjectured that despite different initialisations, the solutions found by SGD lie in the same loss valley after taking into account the permutation invariance of neural networks. Concretely, they hypothesise that any two solutions found by SGD can be permuted such that the linear interpolation between their parameters forms a path without significant increases in loss. Here, we use a simple but powerful algorithm to find such permutations that allows us to obtain direct empirical evidence that the hypothesis is true in fully connected networks. Strikingly, we find that two networks already live in the same loss valley at the time of initialisation and averaging their random, but suitably permuted initialisation performs significantly above chance. In contrast, for convolutional architectures, our evidence suggests that the hypothesis does not hold. Especially in a large learning rate regime, SGD seems to discover diverse modes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题