论文标题

脂肪树的高质量故障弹性

High-Quality Fault Resiliency in Fat Trees

论文作者

Gliksberg, John, Capra, Antoine, Louvet, Alexandre, Garcia, Pedro Javier, Sohier, Devan

论文摘要

Coupling regular topologies with optimised routing algorithms is key in pushing the performance of interconnection networks of supercomputers.In this paper we present Dmodc, a fast deterministic routing algorithm for Parallel Generalised Fat-Trees (PGFTs) which minimises congestion risk even under massive network degradation caused by equipment failure.Dmodc computes forwarding tables with a closed-form arithmetic formula通过依靠快速的预处理阶段。这允许在不到一秒钟的时间内完全重新整理具有数以万计节点的网络。这很大程度上有助于集中式的面料管理对具有高质量路由表的故障反应,并且对当前和未来的非常大的HPC群集对运行的应用无影响。

Coupling regular topologies with optimised routing algorithms is key in pushing the performance of interconnection networks of supercomputers.In this paper we present Dmodc, a fast deterministic routing algorithm for Parallel Generalised Fat-Trees (PGFTs) which minimises congestion risk even under massive network degradation caused by equipment failure.Dmodc computes forwarding tables with a closed-form arithmetic formula by relying on a fast preprocessing phase.This allows complete re-routing of networks with tens of thousands of nodes in less than a second.In turn, this greatly helps centralised fabric management react to faults with high-quality routing tables and no impact to running applications in current and future very large-scale HPC clusters.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源