Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale

W Won, T Heo, S Rashidi, S Sridharan… - … Analysis of Systems …, 2023 - ieeexplore.ieee.org
As deep learning models and input data continue to scale at an unprecedented rate, it has
become inevitable to move towards distributed training platforms to fit the models and …

Enabling compute-communication overlap in distributed deep learning training platforms

S Rashidi, M Denton, S Sridharan… - 2021 ACM/IEEE 48th …, 2021 - ieeexplore.ieee.org
Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators
(eg, GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth …

Themis: A network bandwidth-aware collective scheduling policy for distributed training of dl models

S Rashidi, W Won, S Srinivasan, S Sridharan… - Proceedings of the 49th …, 2022 - dl.acm.org
Distributed training is a solution to reduce DNN training time by splitting the task across
multiple NPUs (eg, GPU/TPU). However, distributed training adds communication overhead …

Impact of RoCE congestion control policies on distributed training of DNNs

T Khan, S Rashidi, S Sridharan… - … IEEE Symposium on …, 2022 - ieeexplore.ieee.org
Ahstract-RDMA over Converged Ethernet (RoCE) has gained significant attraction for
datacenter networks due to its compatibility with conventional Ethernet-based fabric …

Analysis of distributed deep learning in the cloud

A Sharma, VM Bhasi, S Singh, R Jain… - arXiv preprint arXiv …, 2022 - arxiv.org
We aim to resolve this problem by introducing a comprehensive distributed deep learning
(DDL) profiler, which can determine the various execution" stalls" that DDL suffers from while …

Co-designing the topology/algorithm to accelerate distributed training

X Hou, R Xu, S Ma, Q Wang, W Jiang… - 2021 IEEE Intl Conf on …, 2021 - ieeexplore.ieee.org
With the development of Deep Learning (DL), Deep Neural Network (DNN) models have
become more complex. At the same time, the development of the Internet makes it easy to …

GPU Cluster Scheduling for Network-Sensitive Deep Learning

A Sharma, VM Bhasi, S Singh, G Kesidis… - arXiv preprint arXiv …, 2024 - arxiv.org
We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables
proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the …

FRED: Flexible REduction-Distribution Interconnect and Communication Implementation for Wafer-Scale Distributed Training of DNN Models

S Rashidi, W Won, S Srinivasan, P Gupta… - arXiv preprint arXiv …, 2024 - arxiv.org
Distributed Deep Neural Network (DNN) training is a technique to reduce the training
overhead by distributing the training tasks into multiple accelerators, according to a …

Optimizing the Parallelism of Communication and Computation in Distributed Training Platform

X Hou, Y Yuan, S Ma, R Xu, B Wang, T Li… - … on Algorithms and …, 2023 - Springer
With the development of deep learning, DNN models have become more complex. Large-
scale model parameters enhance the level of AI by improving the accuracy of DNN models …

Research on the transmission performance of QUIC in data center network

Z Xing, H Qi, L Cong, X Di… - … Conference on Electronic …, 2021 - ieeexplore.ieee.org
With the continuous development of emerging technologies such as big data, network
application scenarios are becoming more abundant, which makes the network traffic grow …