Synthesizing optimal collective algorithms

D Mudigere, Y Hao, J Huang, Z Jia, A Tulloch… - Proceedings of the 49th …, 2022 - dl.acm.org

Deep learning recommendation models (DLRMs) have been used across many business-
critical services at Meta and are the single largest AI application in terms of infrastructure …

被引用次数：84 相关文章所有 7 个版本

[PDF] mlsys.org

On optimizing the communication of model parallelism

Y Zhuang, L Zheng, Z Li, E Xing, Q Ho… - Proceedings of …, 2023 - proceedings.mlsys.org

We study a novel and important communication pattern in large-scale model-parallel deep
learning (DL), which we call cross-mesh resharding. This pattern emerges when the two …

被引用次数：20 相关文章所有 5 个版本

[PDF] arxiv.org

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

F Liang, Z Zhang, H Lu, V Leung, Y Guo… - arXiv preprint arXiv …, 2024 - arxiv.org

With the rapid growth in the volume of data sets, models, and devices in the domain of deep
learning, there is increasing attention on large-scale distributed deep learning. In contrast to …

被引用次数：2 相关文章所有 2 个版本

[PDF] acm.org

Themis: A network bandwidth-aware collective scheduling policy for distributed training of dl models

S Rashidi, W Won, S Srinivasan, S Sridharan… - Proceedings of the 49th …, 2022 - dl.acm.org

Distributed training is a solution to reduce DNN training time by splitting the task across
multiple NPUs (eg, GPU/TPU). However, distributed training adds communication overhead …

被引用次数：26 相关文章所有 7 个版本

[PDF] google.com

Robust searching-based gradient collaborative management in intelligent transportation system

H Shi, H Wang, R Ma, Y Hua, T Song, H Gao… - ACM Transactions on …, 2023 - dl.acm.org

With the rapid development of big data and the Internet of Things (IoT), traffic data from an
Intelligent Transportation System (ITS) is becoming more and more accessible. To …

被引用次数：15 相关文章所有 3 个版本

[PDF] epfl.ch

Mscclang: Microsoft collective communication language

M Cowan, S Maleki, M Musuvathi, O Saarikivi… - Proceedings of the 28th …, 2023 - dl.acm.org

Machine learning models with millions or billions of parameters are increasingly trained and
served on large multi-GPU systems. As models grow in size and execute on more GPUs …

被引用次数：11 相关文章所有 2 个版本

[PDF] usenix.org

Swing: Short-cutting Rings for Higher Bandwidth Allreduce

D De Sensi, T Bonato, D Saam, T Hoefler - 21st USENIX Symposium on …, 2024 - usenix.org

The allreduce collective operation accounts for a significant fraction of the runtime of
workloads running on distributed systems. One factor determining its performance is the …

被引用次数：3 相关文章所有 5 个版本

[PDF] mlsys.org

Synthesizing optimal parallelism placement and reduction strategies on hierarchical systems for deep learning

N Xie, T Norman, D Grewe… - Proceedings of Machine …, 2022 - proceedings.mlsys.org

We present a novel characterization of the mapping of multiple parallelism forms (eg data
and model parallelism) onto hierarchical accelerator systems that ishierarchy-aware and …

被引用次数：12 相关文章所有 5 个版本

[PDF] ict.ac.cn

xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning

A Weingram, Y Li, H Qi, D Ng, L Dai, X Lu - Journal of Computer Science …, 2023 - Springer

Abstract Machine learning techniques have become ubiquitous both in industry and
academic applications. Increasing model sizes and training data volumes necessitate fast …

被引用次数：9 相关文章所有 8 个版本

[PDF] arxiv.org

Osdp: Optimal sharded data parallel for distributed deep learning

Y Jiang, F Fu, X Miao, X Nie, B Cui - arXiv preprint arXiv:2209.13258, 2022 - arxiv.org

Large-scale deep learning models contribute to significant performance improvements on
varieties of downstream tasks. Current data and model parallelism approaches utilize model …

被引用次数：7 相关文章所有 6 个版本

高级搜索

QQ 群