论文标题
梯度下降优化了具有线性宽度的无限深度隐式网络
Gradient Descent Optimizes Infinite-Depth ReLU Implicit Networks with Linear Widths
论文作者
论文摘要
隐含的深度学习最近在机器学习社区中变得很流行,因为这些隐式模型可以通过最先进的深层网络实现竞争性能,同时使用明显更少的内存和计算资源。但是,我们对诸如梯度下降(GD)之类的一阶方法的理论理解是有限的。尽管已经在标准馈送网络中研究了这种类型的问题,但隐式模型的情况仍然很有趣,因为隐式网络具有许多层。相应的平衡方程可能在训练过程中允许没有或多个解决方案。本文研究了非线性relu激活的隐式网络的梯度流(GF)和梯度下降的收敛性。为了处理适合良好的问题,我们引入了一个固定的标量来扩展隐式层的重量矩阵,并表明存在一个足够小的扩展常数,从而使平衡方程保持在整个训练过程中。结果,我们证明,如果隐式网络的宽度$ m $为\ textit {lineare},则GF和GD以线性速率收敛到全局最小值,例如,样本尺寸$ n $,即$ m =ω(n)$。
Implicit deep learning has recently become popular in the machine learning community since these implicit models can achieve competitive performance with state-of-the-art deep networks while using significantly less memory and computational resources. However, our theoretical understanding of when and how first-order methods such as gradient descent (GD) converge on \textit{nonlinear} implicit networks is limited. Although this type of problem has been studied in standard feed-forward networks, the case of implicit models is still intriguing because implicit networks have \textit{infinitely} many layers. The corresponding equilibrium equation probably admits no or multiple solutions during training. This paper studies the convergence of both gradient flow (GF) and gradient descent for nonlinear ReLU activated implicit networks. To deal with the well-posedness problem, we introduce a fixed scalar to scale the weight matrix of the implicit layer and show that there exists a small enough scaling constant, keeping the equilibrium equation well-posed throughout training. As a result, we prove that both GF and GD converge to a global minimum at a linear rate if the width $m$ of the implicit network is \textit{linear} in the sample size $N$, i.e., $m=Ω(N)$.