作者
William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, Tushar Krishna
发表日期
2023/4/23
研讨会论文
2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
页码范围
283-294
出版商
IEEE
简介
As deep learning models and input data continue to scale at an unprecedented rate, it has become inevitable to move towards distributed training platforms to fit the models and increase training throughput. State-of-the-art distributed training systems are adopting emerging approaches and techniques such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and optimized parallelization strategies. This results in a complex software/hardware co-design stack, necessitating a modeling/simulation infrastructure for design-space exploration. This paper introduces ASTRA-sim2.0, which extends the open-source ASTRA-sim infrastructure with capabilities to model state-of-the-art and emerging distributed training models and platforms. Specifically, we enable ASTRAsim to (i) support arbitrary model parallelization strategies via a graph-based training-loop implementation, (ii …
引用总数
学术搜索中的文章
W Won, T Heo, S Rashidi, S Sridharan, S Srinivasan… - 2023 IEEE International Symposium on Performance …, 2023