Pytorch fsdp: experiences on scaling fully sharded data parallel

Y Zhao, A Gu, R Varma, L Luo, CC Huang, M Xu… - arXiv preprint arXiv …, 2023 - arxiv.org
It is widely acknowledged that large models have the potential to deliver superior
performance across a broad range of domains. Despite the remarkable progress made in …

On optimizing the communication of model parallelism

Y Zhuang, L Zheng, Z Li, E Xing, Q Ho… - Proceedings of …, 2023 - proceedings.mlsys.org
We study a novel and important communication pattern in large-scale model-parallel deep
learning (DL), which we call cross-mesh resharding. This pattern emerges when the two …

Horus: Interference-aware and prediction-based scheduling in deep learning systems

G Yeung, D Borowiec, R Yang, A Friday… - … on Parallel and …, 2021 - ieeexplore.ieee.org
To accelerate the training of Deep Learning (DL) models, clusters of machines equipped
with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of …

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

F Liang, Z Zhang, H Lu, V Leung, Y Guo… - arXiv preprint arXiv …, 2024 - arxiv.org
With the rapid growth in the volume of data sets, models, and devices in the domain of deep
learning, there is increasing attention on large-scale distributed deep learning. In contrast to …

Dear: Accelerating distributed deep learning with fine-grained all-reduce pipelining

L Zhang, S Shi, X Chu, W Wang, B Li… - 2023 IEEE 43rd …, 2023 - ieeexplore.ieee.org
Communication scheduling has been shown to be effective in accelerating distributed
training, which enables all-reduce communications to be overlapped with backpropagation …

DC2: Delay-aware compression control for distributed machine learning

AM Abdelmoniem, M Canini - IEEE INFOCOM 2021-IEEE …, 2021 - ieeexplore.ieee.org
Distributed training performs data-parallel training of DNN models which is a necessity for
increasingly complex models and large datasets. Recent works are identifying major …

Task placement and resource allocation for edge machine learning: A gnn-based multi-agent reinforcement learning paradigm

Y Li, X Zhang, T Zeng, J Duan, C Wu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Machine learning (ML) tasks are one of the major workloads in today's edge computing
networks. Existing edge-cloud schedulers allocate the requested amounts of resources to …

Synthesizing optimal collective algorithms

Z Cai, Z Liu, S Maleki, M Musuvathi… - Proceedings of the 26th …, 2021 - dl.acm.org
Collective communication algorithms are an important component of distributed
computation. Indeed, in the case of deep-learning, collective communication is the Amdahl's …

Tapfinger: Task placement and fine-grained resource allocation for edge machine learning

Y Li, T Zeng, X Zhang, J Duan… - IEEE INFOCOM 2023 …, 2023 - ieeexplore.ieee.org
Machine learning (ML) tasks are one of the major workloads in today's edge computing
networks. Existing edge-cloud schedulers allocate the requested amounts of resources to …

Mscclang: Microsoft collective communication language

M Cowan, S Maleki, M Musuvathi, O Saarikivi… - Proceedings of the 28th …, 2023 - dl.acm.org
Machine learning models with millions or billions of parameters are increasingly trained and
served on large multi-GPU systems. As models grow in size and execute on more GPUs …