Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale

W Won, T Heo, S Rashidi, S Sridharan… - … Analysis of Systems …, 2023 - ieeexplore.ieee.org
As deep learning models and input data continue to scale at an unprecedented rate, it has
become inevitable to move towards distributed training platforms to fit the models and …

Roar: A router microarchitecture for in-network allreduce

R Wang, D Dong, F Lei, J Ma, K Wu, K Lu - Proceedings of the 37th …, 2023 - dl.acm.org
The allreduce operation is the most commonly used collective operation in distributed or
parallel applications. It aggregates data collected from distributed hosts and broadcasts the …

Logical/physical topology-aware collective communication in deep learning training

S Cho, H Son, J Kim - 2023 IEEE International Symposium on …, 2023 - ieeexplore.ieee.org
Training is an important aspect of deep learning to enable network models to be deployed.
To scale training, multiple GPUs are commonly used with data parallelism to exploit the …

vtrain: A simulation framework for evaluating cost-effective and compute-optimal large language model training

J Bang, Y Choi, M Kim, Y Kim, M Rhu - arXiv preprint arXiv:2312.12391, 2023 - arxiv.org
As large language models (LLMs) become widespread in various application domains, a
critical challenge the AI community is facing is how to train these large AI models in a cost …

Impact of RoCE congestion control policies on distributed training of DNNs

T Khan, S Rashidi, S Sridharan… - … IEEE Symposium on …, 2022 - ieeexplore.ieee.org
Ahstract-RDMA over Converged Ethernet (RoCE) has gained significant attraction for
datacenter networks due to its compatibility with conventional Ethernet-based fabric …

Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMM

C Block, G Gerogiannis, C Mendis, A Azad… - Proceedings of the 29th …, 2024 - dl.acm.org
Sparse matrix dense matrix multiplication (SpMM) is commonly used in applications ranging
from scientific computing to graph neural networks. Typically, when SpMM is executed in a …

PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices

SU Noh, J Hong, C Lim, S Park, J Kim, H Kim… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent dual in-line memory modules (DIMMs) are starting to support processing-in-memory
(PIM) by associating their memory banks with processing elements (PEs), allowing …

COMET: A comprehensive cluster design methodology for distributed deep learning training

DK Kadiyala, S Rashidi, T Heo, AR Bambhaniya… - arXiv preprint arXiv …, 2022 - arxiv.org
Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of
specialized, high-end nodes to train. Designing such clusters to maximize both performance …

Enhancing Collective Communication in MCM Accelerators for Deep Learning Training

S Laskar, P Majhi, S Kim, F Mahmud… - … Symposium on High …, 2024 - ieeexplore.ieee.org
With the widespread adoption of Deep Learning (DL) models, the demand for DL
accelerator hardware has risen. On top of that, DL models are becoming massive in size. To …

TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning

W Won, M Elavazhagan, S Srinivasan, A Durg… - arXiv preprint arXiv …, 2023 - arxiv.org
The surge of artificial intelligence, specifically large language models, has led to a rapid
advent towards the development of large-scale machine learning training clusters …