论文标题
在随机梯度的被忽视结构上
On the Overlooked Structure of Stochastic Gradients
论文作者
论文摘要
随机梯度与深神经网络(DNN)的优化和概括密切相关。一些作品试图通过梯度噪声的重型尾巴特性来解释随机优化对深度学习的成功,而其他作品则提供了反对梯度噪声的重尾假设的理论和经验证据。不幸的是,在深度学习中,用于分析随机梯度的结构和沉重尾巴的正式统计检验仍未探索。在本文中,我们主要做出两项贡献。首先,我们对参数和迭代中随机梯度和梯度噪声的分布进行正式统计检验。我们的统计测试表明,尺寸梯度通常表现出强大的尾巴,而通过Minibatch训练引起的迭代梯度和随机梯度噪声通常不会表现出强力法的重型尾巴。其次,我们进一步发现,随机梯度的协方差光谱具有先前研究所忽略的幂律结构,并呈现了其对DNN训练的理论意义。尽管先前的研究认为随机梯度的各向异性结构与深度学习有关,但他们并不认为梯度协方差可以具有如此优雅的数学结构。我们的工作挑战了现有的信念,并为深度学习中随机梯度的结构提供了新的见解。
Stochastic gradients closely relate to both optimization and generalization of deep neural networks (DNNs). Some works attempted to explain the success of stochastic optimization for deep learning by the arguably heavy-tail properties of gradient noise, while other works presented theoretical and empirical evidence against the heavy-tail hypothesis on gradient noise. Unfortunately, formal statistical tests for analyzing the structure and heavy tails of stochastic gradients in deep learning are still under-explored. In this paper, we mainly make two contributions. First, we conduct formal statistical tests on the distribution of stochastic gradients and gradient noise across both parameters and iterations. Our statistical tests reveal that dimension-wise gradients usually exhibit power-law heavy tails, while iteration-wise gradients and stochastic gradient noise caused by minibatch training usually do not exhibit power-law heavy tails. Second, we further discover that the covariance spectra of stochastic gradients have the power-law structures overlooked by previous studies and present its theoretical implications for training of DNNs. While previous studies believed that the anisotropic structure of stochastic gradients matters to deep learning, they did not expect the gradient covariance can have such an elegant mathematical structure. Our work challenges the existing belief and provides novel insights on the structure of stochastic gradients in deep learning.