Exploring multi-dimensional hierarchical network topologies for efficient distributed training of trillion parameter dl models

W Won, S Rashidi, S Srinivasan, T Krishna - arXiv preprint arXiv …, 2021 - arxiv.org
Deep Neural Networks have gained significant attraction due to their wide applicability in
different domains. DNN sizes and training samples are constantly growing, making training …

FRED: Flexible REduction-Distribution Interconnect and Communication Implementation for Wafer-Scale Distributed Training of DNN Models

S Rashidi, W Won, S Srinivasan, P Gupta… - arXiv preprint arXiv …, 2024 - arxiv.org
Distributed Deep Neural Network (DNN) training is a technique to reduce the training
overhead by distributing the training tasks into multiple accelerators, according to a …

Themis: A network bandwidth-aware collective scheduling policy for distributed training of dl models

S Rashidi, W Won, S Srinivasan, S Sridharan… - Proceedings of the 49th …, 2022 - dl.acm.org
Distributed training is a solution to reduce DNN training time by splitting the task across
multiple NPUs (eg, GPU/TPU). However, distributed training adds communication overhead …

Astra-sim: Enabling sw/hw co-design exploration for distributed dl training platforms

S Rashidi, S Sridharan, S Srinivasan… - … Analysis of Systems …, 2020 - ieeexplore.ieee.org
Modern Deep Learning systems heavily rely on distributed training over high-performance
accelerator (eg, TPU, GPU)-based hardware platforms. Examples today include Google's …

DDLBench: towards a scalable benchmarking infrastructure for distributed deep learning

M Jansen, V Codreanu… - 2020 IEEE/ACM Fourth …, 2020 - ieeexplore.ieee.org
Due to its many applications across various fields of research, engineering, and daily life,
deep learning has seen a surge in popularity. Therefore, larger and more expressive models …

Parameter box: High performance parameter servers for efficient distributed deep neural network training

L Luo, J Nelson, L Ceze, A Phanishayee… - arXiv preprint arXiv …, 2018 - arxiv.org
Most work in the deep learning systems community has focused on faster inference, but
arriving at a trained model requires lengthy experiments. Accelerating training lets …

LIBRA: Enabling Workload-Aware Multi-Dimensional Network Topology Optimization for Distributed Training of Large AI Models

W Won, S Rashidi, S Srinivasan… - 2024 IEEE International …, 2024 - ieeexplore.ieee.org
As model sizes in machine learning continue to scale, distributed training is necessary to
accommodate model weights within each device and to reduce training time. However, this …

An allreduce algorithm and network co-design for large-scale training of distributed deep learning

TT Nguyen, M Wahib - 2021 IEEE/ACM 21st International …, 2021 - ieeexplore.ieee.org
Distributed training of Deep Neural Networks (DNNs) on High-Performance Computing
(HPC) systems is becoming increasingly common. HPC systems dedicated entirely or …

COMET: A comprehensive cluster design methodology for distributed deep learning training

DK Kadiyala, S Rashidi, T Heo, AR Bambhaniya… - arXiv preprint arXiv …, 2022 - arxiv.org
Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of
specialized, high-end nodes to train. Designing such clusters to maximize both performance …

Eflops: Algorithm and system co-design for a high performance distributed training platform

J Dong, Z Cao, T Zhang, J Ye, S Wang… - … Symposium on High …, 2020 - ieeexplore.ieee.org
Deep neural networks (DNNs) have gained tremendous attractions as compelling solutions
for applications such as image classification, object detection, speech recognition, and so …