论文标题
沟通竞争意识安排多个深度学习培训工作
Communication Contention Aware Scheduling of Multiple Deep Learning Training Jobs
论文作者
论文摘要
分布式深度学习(DDL)迅速增长了其知名度,因为它有助于提高高性能GPU群集的训练性能。有效的工作调度是必不可少的,即同时训练多个作业时,可以最大程度地提高集群的整体表现。但是,现有的调度程序不考虑来自不同分布式培训工作的多个通信任务的沟通,这可能会恶化系统的性能并延长工作完成时间。在本文中,我们首先建立了一个新的DDL作业调度框架,该框架按照无环图(DAG)组织DDL作业,并考虑节点之间的通信争夺。然后,我们提出了一种有效的算法LWF-$κ$,以平衡GPU利用率并巩固每个工作的分配GPU。在安排这些通信任务时,我们观察到,既不避免所有争论也不盲目接受它们是最大程度地减少工作完成时间的最佳选择。因此,我们提出了一种可证明的算法(Adadual),以有效地安排这些通信任务。基于Adadual,我们最终为DDL作业计划问题提出了ADA-SRSF。与10 Gbps以太网相连的64-GPU群集上的模拟表明,LWF-$κ$可实现高达$ 1.59的$ 1.59 \ times $ $改进,而经典的第一件算法。更重要的是,与SRSF(1)方案(避免所有争夺)和SRSF(2)方案(分别盲目接受所有双向通信意见)相比,ADA-SRSF将平均职位完成时间减少了$ 20.1 \%$ $和$ 36.7 \%$ $。
Distributed Deep Learning (DDL) has rapidly grown its popularity since it helps boost the training performance on high-performance GPU clusters. Efficient job scheduling is indispensable to maximize the overall performance of the cluster when training multiple jobs simultaneously. However, existing schedulers do not consider the communication contention of multiple communication tasks from different distributed training jobs, which could deteriorate the system performance and prolong the job completion time. In this paper, we first establish a new DDL job scheduling framework which organizes DDL jobs as Directed Acyclic Graphs (DAGs) and considers communication contention between nodes. We then propose an efficient algorithm, LWF-$κ$, to balance the GPU utilization and consolidate the allocated GPUs for each job. When scheduling those communication tasks, we observe that neither avoiding all the contention nor blindly accepting them is optimal to minimize the job completion time. We thus propose a provable algorithm, AdaDUAL, to efficiently schedule those communication tasks. Based on AdaDUAL, we finally propose Ada-SRSF for the DDL job scheduling problem. Simulations on a 64-GPU cluster connected with 10 Gbps Ethernet show that LWF-$κ$ achieves up to $1.59\times$ improvement over the classical first-fit algorithms. More importantly, Ada-SRSF reduces the average job completion time by $20.1\%$ and $36.7\%$, as compared to the SRSF(1) scheme (avoiding all the contention) and the SRSF(2) scheme (blindly accepting all of two-way communication contention) respectively.