论文标题
麦克风:公共云上训练巨大模型的近乎线性缩放
MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud
论文作者
论文摘要
现有用于巨大模型培训的通用框架,即具有数十亿个参数的密集模型,由于大型通信开销而导致具有各种网络条件的云环境,无法有效扩展。在本文中,我们提出了MIC,该麦克风最大程度地减少了沟通量表以降低沟通开销。具体而言,通过减少沟通集体中的参与者人数,MIC可以利用异质网络带宽,减少网络流量,而不是较慢的链接,减少通信的延迟,以维持高网络带宽利用率,并摊销昂贵的全球梯度梯度同步。我们对AWS的评估表明,麦克风的系统吞吐量高达2.89美元$ \ times $ $ \ times $。 MIC可实现接近线性的缩放效率,最高可达1.27 $ \ times thiles $。 MIC允许我们在512 GPU上具有1000亿个参数的专有型号,其弱尺度效率为99.4%,并且与DGX-A100簇相比,在公共云上,每个GPU在公共云上的理论计算能力超过54.5%,而GPU的理论计算能力超过54.5%。
Existing general purpose frameworks for gigantic model training, i.e., dense models with billions of parameters, cannot scale efficiently on cloud environment with various networking conditions due to large communication overheads. In this paper, we propose MiCS, which Minimizes the Communication Scale to bring down communication overhead. Specifically, by decreasing the number of participants in a communication collective, MiCS can utilize heterogeneous network bandwidth, reduce network traffic over slower links, reduce the latency of communications for maintaining high network bandwidth utilization, and amortize expensive global gradient synchronization overhead. Our evaluation on AWS shows that the system throughput of MiCS is up to 2.89$\times$ that of the state-of-the-art large model training systems. MiCS achieves near-linear scaling efficiency, which is up to 1.27$\times$ that of DeepSpeed. MiCS allows us to train a proprietary model with 100 billion parameters on 512 GPUs with 99.4% weak-scaling efficiency, and it is able to saturate over 54.5% theoretical computation power of each GPU on a public cloud with less GPU memory and more restricted networks than DGX-A100 clusters.