Efficient flow scheduling in distributed deep learning training with echelon formation

J Cao, Y Guan, K Qian, J Gao, W Xiao, J Dong… - Proceedings of the …, 2024 - dl.acm.org

Deep learning training (DLT), eg, large language model (LLM) training, has become one of
the most important services in multitenant cloud computing. By deeply studying in …

被引用次数：3 相关文章

[PDF] acm.org

MLTCP: A Distributed Technique to Approximate Centralized Flow Scheduling For Machine Learning

S Rajasekaran, S Narang, AA Zabreyko… - Proceedings of the 23rd …, 2024 - dl.acm.org

This paper argues that congestion control protocols in machine learning datacenters sit at a
sweet spot between centralized and distributed flow scheduling solutions. We present …

被引用次数：1 相关文章所有 4 个版本

[PDF] arxiv.org

Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling

J Li, S Tripathi, L Rastogi, Y Lei, R Pan… - arXiv preprint arXiv …, 2024 - arxiv.org

As machine learning models scale in size and complexity, their computational requirements
become a significant barrier. Mixture-of-Experts (MoE) models alleviate this issue by …

被引用次数：1 相关文章所有 3 个版本

Dynamic Flow Scheduling for DNN Training Workloads in Data Centers

X Zhao, C Wu, X Zhu - IEEE Transactions on Network and …, 2024 - ieeexplore.ieee.org

Distributed deep learning (DL) training constitutes a significant portion of workloads in
modern data centers that are equipped with high computational capacities, such as GPU …

MLTCP: Congestion Control for DNN Training

S Rajasekaran, S Narang, AA Zabreyko… - arXiv preprint arXiv …, 2024 - arxiv.org

We present MLTCP, a technique to augment today's congestion control algorithms to
accelerate DNN training jobs in shared GPU clusters. MLTCP enables the communication …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities

Y Wei, T Hu, C Liang, Y Cui - arXiv preprint arXiv:2403.07585, 2024 - arxiv.org

The past few years have witnessed the flourishing of large-scale deep neural network
models with ever-growing parameter numbers. Training such large-scale models typically …

被引用次数：1 相关文章所有 2 个版本

高级搜索

QQ 群