论文标题
过度参数的非线性系统和神经网络的损失景观和优化
Loss landscapes and optimization in over-parameterized non-linear systems and neural networks
论文作者
论文摘要
深度学习的成功是由于在很大程度上是由于应用于大型神经网络的基于梯度的优化方法的显着有效性。这项工作的目的是提出现代视图和一个一般数学框架,用于损失景观,并在过度参数化的机器学习模型和非线性方程的系统中有效优化,该设置包括过度参数的深神经网络。我们的开始观察是,与此类系统相对应的优化问题通常不是凸,甚至在本地也不是。我们认为,相反,它们满足了PL $^*$,这是大多数(但不是全部)参数空间的Polyak-lojasiewicz条件的变体,这可以保证解决方案的存在和(随机)梯度下降(SGD/GD)的有效优化。这些系统的PL $^*$条件与与非线性系统相关的切线内核的条件数密切相关,以表明基于PL $^*$的非线性理论如何与过度参数的线性方程进行经典分析。我们表明,广泛的神经网络满足PL $^*$状况,该条件将(S)GD收敛解释为全球最低限度。最后,我们提出了适用于“几乎”过度参数化系统的PL $^*$条件的放松。
The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. The purpose of this work is to propose a modern view and a general mathematical framework for loss landscapes and efficient optimization in over-parameterized machine learning models and systems of non-linear equations, a setting that includes over-parameterized deep neural networks. Our starting observation is that optimization problems corresponding to such systems are generally not convex, even locally. We argue that instead they satisfy PL$^*$, a variant of the Polyak-Lojasiewicz condition on most (but not all) of the parameter space, which guarantees both the existence of solutions and efficient optimization by (stochastic) gradient descent (SGD/GD). The PL$^*$ condition of these systems is closely related to the condition number of the tangent kernel associated to a non-linear system showing how a PL$^*$-based non-linear theory parallels classical analyses of over-parameterized linear equations. We show that wide neural networks satisfy the PL$^*$ condition, which explains the (S)GD convergence to a global minimum. Finally we propose a relaxation of the PL$^*$ condition applicable to "almost" over-parameterized systems.