论文标题
确定良好的方向以逃避NTK制度并有效学习低度和稀疏多项式
Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials
论文作者
论文摘要
深度学习理论的最新目标是确定神经网络如何逃脱“懒惰训练”或神经切线内核(NTK)制度,在该制度中,该网络与初始化时的一阶Taylor扩展相结合。虽然NTK是最大程度地学习密集多项式(Ghorbani et al,2021)的最佳选择,但它无法学习特征,因此对于学习包括稀疏多项式(稀疏多项式)的许多类别的功能的样本复杂性较差。因此,最近的工作旨在确定基于梯度的算法比NTK更好地概括的设置。一个这样的例子是Bai和Lee(2020)的“ Quadntk”方法,该方法分析了泰勒膨胀中的二阶项。 Bai和Lee(2020)表明,二阶项可以有效地学习稀疏的多项式。但是,它牺牲了学习一般密集多项式的能力。 在本文中,我们分析了两层神经网络上的梯度下降如何通过利用NTK(Montanari和Zhong,2020)的光谱表征来逃脱NTK策略,并在Quadntk方法上构建。我们首先扩展了光谱分析,以确定参数空间中的“良好”方向,在这些方向上我们可以在不损害概括的情况下移动。接下来,我们表明,宽大的两层神经网络可以共同使用NTK和Quadntk来适合由密集的低度项和稀疏高度术语组成的目标功能 - NTK和Quadntk无法自行执行。最后,我们构建了一个正规器,该正规化器鼓励我们的参数向量以“良好”的方向移动,并表明正规化损失上的梯度下降将融合到全局最小化器,该全局最小化的测试误差也较低。这产生了端到端的融合和概括保证,并且自行对NTK和Quadntk进行了可证明的样本复杂性的改善。
A recent goal in the theory of deep learning is to identify how neural networks can escape the "lazy training," or Neural Tangent Kernel (NTK) regime, where the network is coupled with its first order Taylor expansion at initialization. While the NTK is minimax optimal for learning dense polynomials (Ghorbani et al, 2021), it cannot learn features, and hence has poor sample complexity for learning many classes of functions including sparse polynomials. Recent works have thus aimed to identify settings where gradient based algorithms provably generalize better than the NTK. One such example is the "QuadNTK" approach of Bai and Lee (2020), which analyzes the second-order term in the Taylor expansion. Bai and Lee (2020) show that the second-order term can learn sparse polynomials efficiently; however, it sacrifices the ability to learn general dense polynomials. In this paper, we analyze how gradient descent on a two-layer neural network can escape the NTK regime by utilizing a spectral characterization of the NTK (Montanari and Zhong, 2020) and building on the QuadNTK approach. We first expand upon the spectral analysis to identify "good" directions in parameter space in which we can move without harming generalization. Next, we show that a wide two-layer neural network can jointly use the NTK and QuadNTK to fit target functions consisting of a dense low-degree term and a sparse high-degree term -- something neither the NTK nor the QuadNTK can do on their own. Finally, we construct a regularizer which encourages our parameter vector to move in the "good" directions, and show that gradient descent on the regularized loss will converge to a global minimizer, which also has low test error. This yields an end to end convergence and generalization guarantee with provable sample complexity improvement over both the NTK and QuadNTK on their own.