Blueconnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy

D Lepikhin, HJ Lee, Y Xu, D Chen, O Firat… - arXiv preprint arXiv …, 2020 - arxiv.org

Neural network scaling has been critical for improving the model quality in many real-world
machine learning applications with vast amounts of training data and compute. Although this …

被引用次数：972 相关文章所有 7 个版本

[PDF] arxiv.org

Pytorch distributed: Experiences on accelerating data parallel training

S Li, Y Zhao, R Varma, O Salpekar, P Noordhuis… - arXiv preprint arXiv …, 2020 - arxiv.org

This paper presents the design, implementation, and evaluation of the PyTorch distributed
data parallel module. PyTorch is a widely-adopted scientific computing package used in …

被引用次数：623 相关文章所有 11 个版本

[PDF] usenix.org

{ATP}: In-network aggregation for multi-tenant learning

CL Lao, Y Le, K Mahajan, Y Chen, W Wu… - … USENIX Symposium on …, 2021 - usenix.org

Distributed deep neural network training (DT) systems are widely deployed in clusters where
the network is shared across multiple tenants, ie, multiple DT jobs. Each DT job computes …

被引用次数：233 相关文章所有 11 个版本

[PDF] usenix.org

{HetPipe}: Enabling large {DNN} training on (whimpy) heterogeneous {GPU} clusters through integration of pipelined model parallelism and data parallelism

JH Park, G Yun, MY Chang, NT Nguyen, S Lee… - 2020 USENIX Annual …, 2020 - usenix.org

Deep Neural Network (DNN) models have continuously been growing in size in order to
improve the accuracy and quality of the models. Moreover, for training of large DNN models …

被引用次数：144 相关文章所有 10 个版本

[PDF] acm.org

SiP-ML: high-bandwidth optical network interconnects for machine learning training

M Khani, M Ghobadi, M Alizadeh, Z Zhu… - Proceedings of the …, 2021 - dl.acm.org

This paper proposes optical network interconnects as a key enabler for building high-
bandwidth ML training clusters with strong scaling properties. Our design, called SiP-ML …

被引用次数：83 相关文章所有 8 个版本

[PDF] mlsys.org

[PDF][PDF] Apollo: Automatic partition-based operator fusion through layer by layer optimization

J Zhao, X Gao, R Xia, Z Zhang… - Proceedings of …, 2022 - proceedings.mlsys.org

We study fusion for deep neural networks (DNNs) in a just-in-time (JIT) compilation
framework Apollo. It considers both memory-and compute-bound tensor operators for fusion …

被引用次数：43 相关文章所有 4 个版本

[PDF] mlsys.org

On optimizing the communication of model parallelism

Y Zhuang, L Zheng, Z Li, E Xing, Q Ho… - Proceedings of …, 2023 - proceedings.mlsys.org

We study a novel and important communication pattern in large-scale model-parallel deep
learning (DL), which we call cross-mesh resharding. This pattern emerges when the two …

被引用次数：30 相关文章所有 5 个版本

[PDF] usenix.org

Accelerating distributed {MoE} training and inference with lina

J Li, Y Jiang, Y Zhu, C Wang, H Xu - 2023 USENIX Annual Technical …, 2023 - usenix.org

Scaling model parameters improves model quality at the price of high computation
overhead. Sparsely activated models, usually in the form of Mixture of Experts (MoE) …

被引用次数：30 相关文章所有 7 个版本

[PDF] mlsys.org

Blink: Fast and generic collectives for distributed ml

G Wang, S Venkataraman… - Proceedings of …, 2020 - proceedings.mlsys.org

Abstract Model parameter synchronization across GPUs introduces high overheads for data-
parallel training at scale. Existing parameter synchronization protocols cannot effectively …

被引用次数：140 相关文章所有 11 个版本

[PDF] acm.org

Efficient sparse collective communication and its application to accelerate distributed deep learning

J Fei, CY Ho, AN Sahu, M Canini, A Sapio - Proceedings of the 2021 …, 2021 - dl.acm.org

Efficient collective communication is crucial to parallel-computing applications such as
distributed training of large-scale recommendation systems and natural language …

被引用次数：93 相关文章所有 7 个版本

高级搜索

QQ 群