Plink: Discovering and exploiting locality for accelerated distributed training on the public cloud

Y Zhao, A Gu, R Varma, L Luo, CC Huang, M Xu… - arXiv preprint arXiv …, 2023 - arxiv.org

It is widely acknowledged that large models have the potential to deliver superior
performance across a broad range of domains. Despite the remarkable progress made in …

被引用次数：109 相关文章所有 3 个版本

[PDF] mlsys.org

On optimizing the communication of model parallelism

Y Zhuang, L Zheng, Z Li, E Xing, Q Ho… - Proceedings of …, 2023 - proceedings.mlsys.org

We study a novel and important communication pattern in large-scale model-parallel deep
learning (DL), which we call cross-mesh resharding. This pattern emerges when the two …

被引用次数：20 相关文章所有 5 个版本

[PDF] ieee.org

Horus: Interference-aware and prediction-based scheduling in deep learning systems

G Yeung, D Borowiec, R Yang, A Friday… - … on Parallel and …, 2021 - ieeexplore.ieee.org

To accelerate the training of Deep Learning (DL) models, clusters of machines equipped
with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of …

被引用次数：59 相关文章所有 7 个版本

[PDF] arxiv.org

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

F Liang, Z Zhang, H Lu, V Leung, Y Guo… - arXiv preprint arXiv …, 2024 - arxiv.org

With the rapid growth in the volume of data sets, models, and devices in the domain of deep
learning, there is increasing attention on large-scale distributed deep learning. In contrast to …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Dear: Accelerating distributed deep learning with fine-grained all-reduce pipelining

L Zhang, S Shi, X Chu, W Wang, B Li… - 2023 IEEE 43rd …, 2023 - ieeexplore.ieee.org

Communication scheduling has been shown to be effective in accelerating distributed
training, which enables all-reduce communications to be overlapped with backpropagation …

被引用次数：11 相关文章所有 8 个版本

[PDF] kaust.edu.sa

DC2: Delay-aware compression control for distributed machine learning

AM Abdelmoniem, M Canini - IEEE INFOCOM 2021-IEEE …, 2021 - ieeexplore.ieee.org

Distributed training performs data-parallel training of DNN models which is a necessity for
increasingly complex models and large datasets. Recent works are identifying major …

被引用次数：39 相关文章所有 5 个版本

[PDF] arxiv.org

Task placement and resource allocation for edge machine learning: A gnn-based multi-agent reinforcement learning paradigm

Y Li, X Zhang, T Zeng, J Duan, C Wu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Machine learning (ML) tasks are one of the major workloads in today's edge computing
networks. Existing edge-cloud schedulers allocate the requested amounts of resources to …

被引用次数：13 相关文章所有 9 个版本

[PDF] arxiv.org

Synthesizing optimal collective algorithms

Z Cai, Z Liu, S Maleki, M Musuvathi… - Proceedings of the 26th …, 2021 - dl.acm.org

Collective communication algorithms are an important component of distributed
computation. Indeed, in the case of deep-learning, collective communication is the Amdahl's …

被引用次数：40 相关文章所有 3 个版本

[PDF] hku.hk

Tapfinger: Task placement and fine-grained resource allocation for edge machine learning

Y Li, T Zeng, X Zhang, J Duan… - IEEE INFOCOM 2023 …, 2023 - ieeexplore.ieee.org

Machine learning (ML) tasks are one of the major workloads in today's edge computing
networks. Existing edge-cloud schedulers allocate the requested amounts of resources to …

被引用次数：9 相关文章所有 6 个版本

[PDF] epfl.ch

Mscclang: Microsoft collective communication language

M Cowan, S Maleki, M Musuvathi, O Saarikivi… - Proceedings of the 28th …, 2023 - dl.acm.org

Machine learning models with millions or billions of parameters are increasingly trained and
served on large multi-GPU systems. As models grow in size and execute on more GPUs …

被引用次数：11 相关文章所有 2 个版本

高级搜索

QQ 群