论文标题
分布式随机梯度下降:非凸性,非平滑度和融合到局部最小值
Distributed Stochastic Gradient Descent: Nonconvexity, Nonsmoothness, and Convergence to Local Minima
论文作者
论文摘要
在集中式设置中,众所周知,随机梯度下降(SGD)避免了鞍点并在非凸问题中收敛到局部最小值。但是,对于分布式一阶算法,缺乏类似的保证。论文研究分布了随机梯度下降(D-SGD) - 基于网络的简单实现SGD。研究了D-SGD避免鞍点并收敛到局部最小值的条件。首先,我们考虑计算关键点的问题。假设损失函数是非convex,并且可能是非平滑的,则表明,对于每个固定的初始化,D-SGD会收敛到损失的临界点,概率一个。接下来,我们考虑避免鞍点的问题。在这种情况下,我们再次假设损失函数可能是非凸和非平滑的,但是在鞍点的附近是光滑的。结果表明,对于任何固定的初始化,D-SGD避免使用概率一个的鞍点。通过使用常规的微分方程(ODE)的随机近似方法研究基础(分布式)梯度流来证明结果,并从动态系统理论(例如稳定的流形)中扩展经典技术。结果在子空间约束优化的一般环境中得到证明,其中D-SGD是一种特殊情况。
In centralized settings, it is well known that stochastic gradient descent (SGD) avoids saddle points and converges to local minima in nonconvex problems. However, similar guarantees are lacking for distributed first-order algorithms. The paper studies distributed stochastic gradient descent (D-SGD)--a simple network-based implementation of SGD. Conditions under which D-SGD avoids saddle points and converges to local minima are studied. First, we consider the problem of computing critical points. Assuming loss functions are nonconvex and possibly nonsmooth, it is shown that, for each fixed initialization, D-SGD converges to critical points of the loss with probability one. Next, we consider the problem of avoiding saddle points. In this case, we again assume that loss functions may be nonconvex and nonsmooth, but are smooth in a neighborhood of a saddle point. It is shown that, for any fixed initialization, D-SGD avoids such saddle points with probability one. Results are proved by studying the underlying (distributed) gradient flow, using the ordinary differential equation (ODE) method of stochastic approximation, and extending classical techniques from dynamical systems theory such as stable manifolds. Results are proved in the general context of subspace-constrained optimization, of which D-SGD is a special case.