Rethinking machine learning collective communication as a multi-commodity flow problem

X Liu, B Arzani, SKR Kakarla, L Zhao, V Liu… - Proceedings of the …, 2024 - dl.acm.org
Cloud operators utilize collective communication optimizers to enhance the efficiency of the
single-tenant, centrally managed training clusters they manage. However, current optimizers …

Credence: Augmenting Datacenter Switch Buffer Sharing with {ML} Predictions

V Addanki, M Pacut, S Schmid - 21st USENIX Symposium on Networked …, 2024 - usenix.org
Packet buffers in datacenter switches are shared across all the switch ports in order to
improve the overall throughput. The trend of shrinking buffer sizes in datacenter switches …

Efficient all-to-all collective communication schedules for direct-connect topologies

P Basu, L Zhao, J Fantl, S Pal… - Proceedings of the 33rd …, 2024 - dl.acm.org
The all-to-all collective communications primitive is widely used in machine learning (ML)
and high performance computing (HPC) workloads, and optimizing its performance is of …

Tacos: Topology-aware collective algorithm synthesizer for distributed machine learning

W Won, M Elavazhagan, S Srinivasan… - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org
The surge of artificial intelligence, particularly large language models, has driven the rapid
development of large-scale machine learning clusters. Executing distributed models on …

Challenging the need for packet spraying in large-scale distributed training

V Addanki, P Goyal, I Marinos - arXiv preprint arXiv:2407.00550, 2024 - arxiv.org
Large-scale distributed training in production datacenters constitutes a challenging
workload bottlenecked by network communication. In response, both major industry players …

Jasper: Scalable and Fair Multicast for Financial Exchanges in the Cloud

M Haseeb, J Geng, U Butler, X Hao… - arXiv preprint arXiv …, 2024 - arxiv.org
Financial exchanges have recently shown an interest in migrating to the public cloud for
scalability, elasticity, and cost savings. However, financial exchanges often have strict …

Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem

B Arzani, SKR Kakarla, M Castro, S Kandula… - arXiv preprint arXiv …, 2023 - arxiv.org
We show communication schedulers' recent work proposed for ML collectives does not
scale to the increasing problem sizes that arise from training larger models. These works …

Optimizing Distributed Workloads With Infrastructure-Managed Communication and Deployment

Y Wu - 2024 - search.proquest.com
As the scale and complexity of distributed workloads grows, performance is no longer the
sole objective sought by application developers and infrastructure operators, as they …