Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale

W Won, T Heo, S Rashidi, S Sridharan… - … Analysis of Systems …, 2023 - ieeexplore.ieee.org
As deep learning models and input data continue to scale at an unprecedented rate, it has
become inevitable to move towards distributed training platforms to fit the models and …

Data center ethernet and remote direct memory access: Issues at hyperscale

T Hoefler, D Roweth, K Underwood, R Alverson… - Computer, 2023 - ieeexplore.ieee.org
Remote direct memory access (RDMA) over converged Ethernet (RoCE) was an attempt to
adopt modern RDMA features into existing Ethernet installations. We revisit RoCE's design …

Datacenter ethernet and rdma: Issues at hyperscale

T Hoefler, D Roweth, K Underwood, B Alverson… - arXiv preprint arXiv …, 2023 - arxiv.org
We observe that emerging artificial intelligence, high-performance computing, and storage
workloads pose new challenges for large-scale datacenter networking. RDMA over …

Practical Packet Deflection in Datacenters

S Abdous, E Sharafzadeh, S Ghorbani - Proceedings of the ACM on …, 2023 - dl.acm.org
Bursts, sudden surges in network utilization, are a significant root cause of packet loss and
high latency in datacenters. Packet deflection, re-routing packets that arrive at a local …

TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning

W Won, M Elavazhagan, S Srinivasan, A Durg… - arXiv preprint arXiv …, 2023 - arxiv.org
The surge of artificial intelligence, specifically large language models, has led to a rapid
advent towards the development of large-scale machine learning training clusters …

On the Burstiness of Distributed Machine Learning Traffic

N Luangsomboon, F Fazel, J Liebeherr… - arXiv preprint arXiv …, 2023 - arxiv.org
Traffic from distributed training of machine learning (ML) models makes up a large and
growing fraction of the traffic mix in enterprise data centers. While work on distributed ML …

GPU Cluster Scheduling for Network-Sensitive Deep Learning

A Sharma, VM Bhasi, S Singh, G Kesidis… - arXiv preprint arXiv …, 2024 - arxiv.org
We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables
proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the …

FG-PFC: A Fine-Grained PFC Mechanism for Lossless RDMA

S Li, C Wang, Y Zhang, C Ma, L Li… - Journal of Physics …, 2023 - iopscience.iop.org
Abstract Remote Direct Memory Access (RDMA) is widely deployed in data centers to
improve the performance, efficiency, and reliability of data centers. Priority-based Flow …