查看文章

arxiv.org 中的 [PDF]

Exploring multi-dimensional hierarchical network topologies for efficient distributed training of trillion parameter dl models

作者

William Won, Saeed Rashidi, Sudarshan Srinivasan, Tushar Krishna

发表日期

2021/9/24

期刊

arXiv preprint arXiv:2109.11762

简介

Deep Neural Networks have gained significant attraction due to their wide applicability in different domains. DNN sizes and training samples are constantly growing, making training of such workloads more challenging. Distributed training is a solution to reduce the training time. High-performance distributed training platforms should leverage multi-dimensional hierarchical networks, which interconnect accelerators through different levels of the network, to dramatically reduce expensive NICs required for the scale-out network. However, it comes at the expense of communication overhead between distributed accelerators to exchange gradients or input/output activation. In order to allow for further scaling of the workloads, communication overhead needs to be minimized. In this paper, we motivate the fact that in training platforms, adding more intermediate network dimensions is beneficial for efficiently mitigating the excessive use of expensive NIC resources. Further, we address different challenges of the DNN training on hierarchical networks. We discuss when designing the interconnect, how to distribute network bandwidth resources across different dimensions in order to (i) maximize BW utilization of all dimensions, and (ii) minimizing the overall training time for the target workload. We then implement a framework that, for a given workload, determines the best network configuration that maximizes performance, or performance-per-cost.

引用总数

被引用次数：3

202220231 2

学术搜索中的文章

Exploring multi-dimensional hierarchical network topologies for efficient distributed training of trillion parameter dl models

W Won, S Rashidi, S Srinivasan, T Krishna - arXiv preprint arXiv:2109.11762, 2021

被引用次数：3 相关文章所有 2 个版本