论文标题

部分可观测时空混沌系统的无模型预测

Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning

论文作者

Koloskova, Anastasia, Stich, Sebastian U., Jaggi, Martin

论文摘要

我们研究了$ n $工人的分布式培训的异步随机梯度下降算法,随着时间的流逝,计算和通信频率的分布式培训。在此算法中,工人按照自己的速度并行计算随机梯度,并在没有任何同步的情况下将其返回服务器。该算法的现有收敛速率对于非凸的平稳目标取决于最大梯度延迟$τ_ {\ max} $,并表明在$ \ mathcal {o \!{o} \!\!\!\!其中$σ$表示随机梯度的差异。 在这项工作中(i),我们获得了$ \ Mathcal {o} \!\ left(σ^2ε^{ - 2}+ \ sqrt {τ_{τ_{\ max}τ_{avg}} av_^the Al a a a av_ $ nesty n $ a a $,可以大大小于$τ_ {\ max} $。我们还提供(ii)一个简单的延迟自适应学习率方案,在该方案下,异步SGD的收敛速率为$ \ Mathcal {o} \!\ left(σ^2ε^{ - 2}+τ_{ - 2}+τ_{avg} avg} am^ε^{ - 1}} \ right),并且不需要任何超级公认。我们的结果允许首次显示异步SGD总是比迷你批次SGD快。此外,(iii)我们考虑了由联邦学习应用激发的异质功能的情况,并通过证明与先前的工作相比对最大延迟的依赖性较弱来提高收敛率。特别是,我们表明,收敛率的异质性项仅受每个工人内平均延迟的影响。

We study the asynchronous stochastic gradient descent algorithm for distributed training over $n$ workers which have varying computation and communication frequency over time. In this algorithm, workers compute stochastic gradients in parallel at their own pace and return those to the server without any synchronization. Existing convergence rates of this algorithm for non-convex smooth objectives depend on the maximum gradient delay $τ_{\max}$ and show that an $ε$-stationary point is reached after $\mathcal{O}\!\left(σ^2ε^{-2}+ τ_{\max}ε^{-1}\right)$ iterations, where $σ$ denotes the variance of stochastic gradients. In this work (i) we obtain a tighter convergence rate of $\mathcal{O}\!\left(σ^2ε^{-2}+ \sqrt{τ_{\max}τ_{avg}}ε^{-1}\right)$ without any change in the algorithm where $τ_{avg}$ is the average delay, which can be significantly smaller than $τ_{\max}$. We also provide (ii) a simple delay-adaptive learning rate scheme, under which asynchronous SGD achieves a convergence rate of $\mathcal{O}\!\left(σ^2ε^{-2}+ τ_{avg}ε^{-1}\right)$, and does not require any extra hyperparameter tuning nor extra communications. Our result allows to show for the first time that asynchronous SGD is always faster than mini-batch SGD. In addition, (iii) we consider the case of heterogeneous functions motivated by federated learning applications and improve the convergence rate by proving a weaker dependence on the maximum delay compared to prior works. In particular, we show that the heterogeneity term in convergence rate is only affected by the average delay within each worker.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源