Software-hardware co-design for fast and scalable training of deep learning recommendation models

D Mudigere, Y Hao, J Huang, Z Jia, A Tulloch… - Proceedings of the 49th …, 2022 - dl.acm.org
Deep learning recommendation models (DLRMs) have been used across many business-
critical services at Meta and are the single largest AI application in terms of infrastructure …

On optimizing the communication of model parallelism

Y Zhuang, L Zheng, Z Li, E Xing, Q Ho… - Proceedings of …, 2023 - proceedings.mlsys.org
We study a novel and important communication pattern in large-scale model-parallel deep
learning (DL), which we call cross-mesh resharding. This pattern emerges when the two …

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

F Liang, Z Zhang, H Lu, V Leung, Y Guo… - arXiv preprint arXiv …, 2024 - arxiv.org
With the rapid growth in the volume of data sets, models, and devices in the domain of deep
learning, there is increasing attention on large-scale distributed deep learning. In contrast to …

Themis: A network bandwidth-aware collective scheduling policy for distributed training of dl models

S Rashidi, W Won, S Srinivasan, S Sridharan… - Proceedings of the 49th …, 2022 - dl.acm.org
Distributed training is a solution to reduce DNN training time by splitting the task across
multiple NPUs (eg, GPU/TPU). However, distributed training adds communication overhead …

Robust searching-based gradient collaborative management in intelligent transportation system

H Shi, H Wang, R Ma, Y Hua, T Song, H Gao… - ACM Transactions on …, 2023 - dl.acm.org
With the rapid development of big data and the Internet of Things (IoT), traffic data from an
Intelligent Transportation System (ITS) is becoming more and more accessible. To …

Mscclang: Microsoft collective communication language

M Cowan, S Maleki, M Musuvathi, O Saarikivi… - Proceedings of the 28th …, 2023 - dl.acm.org
Machine learning models with millions or billions of parameters are increasingly trained and
served on large multi-GPU systems. As models grow in size and execute on more GPUs …

Swing: Short-cutting Rings for Higher Bandwidth Allreduce

D De Sensi, T Bonato, D Saam, T Hoefler - 21st USENIX Symposium on …, 2024 - usenix.org
The allreduce collective operation accounts for a significant fraction of the runtime of
workloads running on distributed systems. One factor determining its performance is the …

Synthesizing optimal parallelism placement and reduction strategies on hierarchical systems for deep learning

N Xie, T Norman, D Grewe… - Proceedings of Machine …, 2022 - proceedings.mlsys.org
We present a novel characterization of the mapping of multiple parallelism forms (eg data
and model parallelism) onto hierarchical accelerator systems that ishierarchy-aware and …

xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning

A Weingram, Y Li, H Qi, D Ng, L Dai, X Lu - Journal of Computer Science …, 2023 - Springer
Abstract Machine learning techniques have become ubiquitous both in industry and
academic applications. Increasing model sizes and training data volumes necessitate fast …

Osdp: Optimal sharded data parallel for distributed deep learning

Y Jiang, F Fu, X Miao, X Nie, B Cui - arXiv preprint arXiv:2209.13258, 2022 - arxiv.org
Large-scale deep learning models contribute to significant performance improvements on
varieties of downstream tasks. Current data and model parallelism approaches utilize model …