Crux: Gpu-efficient communication scheduling for deep learning training

J Cao, Y Guan, K Qian, J Gao, W Xiao, J Dong… - Proceedings of the …, 2024 - dl.acm.org
Deep learning training (DLT), eg, large language model (LLM) training, has become one of
the most important services in multitenant cloud computing. By deeply studying in …

MLTCP: A Distributed Technique to Approximate Centralized Flow Scheduling For Machine Learning

S Rajasekaran, S Narang, AA Zabreyko… - Proceedings of the 23rd …, 2024 - dl.acm.org
This paper argues that congestion control protocols in machine learning datacenters sit at a
sweet spot between centralized and distributed flow scheduling solutions. We present …

Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling

J Li, S Tripathi, L Rastogi, Y Lei, R Pan… - arXiv preprint arXiv …, 2024 - arxiv.org
As machine learning models scale in size and complexity, their computational requirements
become a significant barrier. Mixture-of-Experts (MoE) models alleviate this issue by …

Dynamic Flow Scheduling for DNN Training Workloads in Data Centers

X Zhao, C Wu, X Zhu - IEEE Transactions on Network and …, 2024 - ieeexplore.ieee.org
Distributed deep learning (DL) training constitutes a significant portion of workloads in
modern data centers that are equipped with high computational capacities, such as GPU …

MLTCP: Congestion Control for DNN Training

S Rajasekaran, S Narang, AA Zabreyko… - arXiv preprint arXiv …, 2024 - arxiv.org
We present MLTCP, a technique to augment today's congestion control algorithms to
accelerate DNN training jobs in shared GPU clusters. MLTCP enables the communication …

Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities

Y Wei, T Hu, C Liang, Y Cui - arXiv preprint arXiv:2403.07585, 2024 - arxiv.org
The past few years have witnessed the flourishing of large-scale deep neural network
models with ever-growing parameter numbers. Training such large-scale models typically …