风笛：加速深度建议模型培训

论文标题

风笛：加速深度建议模型培训

BagPipe: Accelerating Deep Recommendation Model Training

论文作者

Agarwal, Saurabh, Yan, Chengpo, Zhang, Ziyi, Venkataraman, Shivaram

论文摘要

基于深度学习的推荐模型（DLRM）被广泛用于多种关键应用程序。有效地培训此类建议模型是具有挑战性的，因为它们包含数十亿个基于嵌入的参数，从而导致嵌入访问的大量开销。通过分析现有的DLRM培训系统，我们观察到，大约75 \％的迭代时间用于嵌入访问和模型同步。本文我们的主要见解是，嵌入访问具有特定的结构，可用于加速培训。我们观察到嵌入访问的偏斜很大，约有1 \％的嵌入量代表总访问的92 \％以上。此外，我们观察到，在离线培训期间，我们可以查看将来的批次，以确切确定将来需要哪些嵌入。基于这些见解，我们开发了风笛，这是一种用于培训深层建议模型的系统，该系统使用缓存和预取用的方法将远程嵌入访问与计算重叠。我们设计了一个Oracle Cacher，这是一种新组件，该组件使用LookAhead算法来生成最佳的缓存更新决策，同时为防止稳定性提供了强大的一致性保证。我们还设计了一个逻辑上复制的，物理分区的缓存，并表明我们的设计可以减少分布式设置中的同步开销。最后，我们提出了一个分解的系统体系结构，并表明我们的设计可以实现低空容错的能力。我们使用三个数据集和四个模型的实验表明，与最新基线相比，风笛的速度最高为5.6倍，同时提供与同步训练相同的收敛性和可重复性保证。

Deep learning based recommendation models (DLRM) are widely used in several business critical applications. Training such recommendation models efficiently is challenging because they contain billions of embedding-based parameters, leading to significant overheads from embedding access. By profiling existing systems for DLRM training, we observe that around 75\% of the iteration time is spent on embedding access and model synchronization. Our key insight in this paper is that embedding access has a specific structure which can be used to accelerate training. We observe that embedding accesses are heavily skewed, with around 1\% of embeddings representing more than 92\% of total accesses. Further, we observe that during offline training we can lookahead at future batches to determine exactly which embeddings will be needed at what iteration in the future. Based on these insights, we develop Bagpipe, a system for training deep recommendation models that uses caching and prefetching to overlap remote embedding accesses with the computation. We design an Oracle Cacher, a new component that uses a lookahead algorithm to generate optimal cache update decisions while providing strong consistency guarantees against staleness. We also design a logically replicated, physically partitioned cache and show that our design can reduce synchronization overheads in a distributed setting. Finally, we propose a disaggregated system architecture and show that our design can enable low-overhead fault tolerance. Our experiments using three datasets and four models show that Bagpipe provides a speed up of up to 5.6x compared to state of the art baselines, while providing the same convergence and reproducibility guarantees as synchronous training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题