查看文章

gatech.edu 中的 [PDF]

Astra-sim: Enabling sw/hw co-design exploration for distributed dl training platforms

作者

Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, Tushar Krishna

发表日期

2020/8/23

研讨会论文

2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

页码范围

81-92

出版商

IEEE

简介

Modern Deep Learning systems heavily rely on distributed training over high-performance accelerator (e.g., TPU, GPU)-based hardware platforms. Examples today include Google's Cloud TPU and Facebook's Zion. DNN training involves a complex interplay between the DNN model architecture, paral-lelization strategy, scheduling strategy, collective communication algorithm, network topology, and the end-point accelerator. As innovation in AI/ML models continues to grow at an accelerated rate, there is a need for a comprehensive methodology to understand and navigate this complex SW/HW design-space for future systems to support efficient training of future DNN models. In this work, we make the following contributions (i) establish the SW/HW design-space for Distributed Training over a hierarchical scale-up fabric, (ii) develop a network simulator for navigating the design-space, and (iii) demonstrate the …

引用总数

被引用次数：50

202020212022202320244 4 7 21 14

学术搜索中的文章

Astra-sim: Enabling sw/hw co-design exploration for distributed dl training platforms

S Rashidi, S Sridharan, S Srinivasan, T Krishna - 2020 IEEE International Symposium on Performance …, 2020

被引用次数：50 相关文章所有 6 个版本