作者
Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, Tushar Krishna
发表日期
2020/8/23
研讨会论文
2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
页码范围
81-92
出版商
IEEE
简介
Modern Deep Learning systems heavily rely on distributed training over high-performance accelerator (e.g., TPU, GPU)-based hardware platforms. Examples today include Google's Cloud TPU and Facebook's Zion. DNN training involves a complex interplay between the DNN model architecture, paral-lelization strategy, scheduling strategy, collective communication algorithm, network topology, and the end-point accelerator. As innovation in AI/ML models continues to grow at an accelerated rate, there is a need for a comprehensive methodology to understand and navigate this complex SW/HW design-space for future systems to support efficient training of future DNN models. In this work, we make the following contributions (i) establish the SW/HW design-space for Distributed Training over a hierarchical scale-up fabric, (ii) develop a network simulator for navigating the design-space, and (iii) demonstrate the …
引用总数
202020212022202320244472114
学术搜索中的文章
S Rashidi, S Sridharan, S Srinivasan, T Krishna - 2020 IEEE International Symposium on Performance …, 2020