论文标题

在平均避免等待组平均的平行随机优化中打破(全局)障碍

Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

论文作者

Li, Shigang, Ben-Nun, Tal, Nadiradze, Giorgi, Di Girolamo, Salvatore, Dryden, Nikoli, Alistarh, Dan, Hoefler, Torsten

论文摘要

大规模学习由沟通时间主导。在节点上分发样品通常会产生最佳性能,但是由于全球信息传播和跨样本长度的负载不平衡而构成缩放挑战。最先进的分散优化器可以减轻问题,但需要更多的迭代才能达到与全球通信同行相同的准确性。我们提出了避开候补的组模型平均(WAGMA)SGD,这是一种避开候补的随机优化器,可通过亚组重量交换来减少全局通信。关键的见解是对平均方案的算法更改和使用组的使用算法的结合。我们证明了Wagma-SGD的收敛性,并从经验上表明它保留了类似于Allreduce-SGD的收敛速率。为了进行评估,我们在Imagenet上训练Resnet-50;用于机器翻译的变压器;和深度加强学习以进行大规模航行。与最先进的分散SGD变体相比,WAGMA-SGD显着改善了训练吞吐量(例如,1,024 GPU用于增强增强学习),并实现了最快的时间(例如,使用最短的训练时间的最高分数)实现最快的时间。

Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates similar to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput (e.g., 2.1x on 1,024 GPUs for reinforcement learning), and achieves the fastest time-to-solution (e.g., the highest score using the shortest training time for Transformer).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源