Blink: Fast and generic collectives for distributed ml

Z Tang, S Shi, W Wang, B Li, X Chu - arXiv preprint arXiv:2003.06307, 2020 - arxiv.org

Distributed deep learning (DL) has become prevalent in recent years to reduce training time
by leveraging multiple computing devices (eg, GPUs/TPUs) due to larger models and …

被引用次数：129 相关文章所有 4 个版本

[PDF] arxiv.org

Pytorch distributed: Experiences on accelerating data parallel training

S Li, Y Zhao, R Varma, O Salpekar, P Noordhuis… - arXiv preprint arXiv …, 2020 - arxiv.org

This paper presents the design, implementation, and evaluation of the PyTorch distributed
data parallel module. PyTorch is a widely-adopted scientific computing package used in …

被引用次数：520 相关文章所有 11 个版本

[PDF] usenix.org

A unified architecture for accelerating distributed {DNN} training in heterogeneous {GPU/CPU} clusters

Y Jiang, Y Zhu, C Lan, B Yi, Y Cui, C Guo - 14th USENIX Symposium on …, 2020 - usenix.org

Data center clusters that run DNN training jobs are inherently heterogeneous. They have
GPUs and CPUs for computation and network bandwidth for distributed training. However …

被引用次数：278 相关文章所有 10 个版本

[PDF] mlsys.org

On optimizing the communication of model parallelism

Y Zhuang, L Zheng, Z Li, E Xing, Q Ho… - Proceedings of …, 2023 - proceedings.mlsys.org

We study a novel and important communication pattern in large-scale model-parallel deep
learning (DL), which we call cross-mesh resharding. This pattern emerges when the two …

被引用次数：20 相关文章所有 5 个版本

[PDF] usenix.org

Accelerating distributed {MoE} training and inference with lina

J Li, Y Jiang, Y Zhu, C Wang, H Xu - 2023 USENIX Annual Technical …, 2023 - usenix.org

Scaling model parameters improves model quality at the price of high computation
overhead. Sparsely activated models, usually in the form of Mixture of Experts (MoE) …

被引用次数：21 相关文章所有 7 个版本

[PDF] acm.org

Efficient sparse collective communication and its application to accelerate distributed deep learning

J Fei, CY Ho, AN Sahu, M Canini, A Sapio - Proceedings of the 2021 …, 2021 - dl.acm.org

Efficient collective communication is crucial to parallel-computing applications such as
distributed training of large-scale recommendation systems and natural language …

被引用次数：83 相关文章所有 7 个版本

[PDF] usenix.org

{CASSINI}:{Network-Aware} Job Scheduling in Machine Learning Clusters

S Rajasekaran, M Ghobadi, A Akella - 21st USENIX Symposium on …, 2024 - usenix.org

We present CASSINI, a network-aware job scheduler for machine learning (ML) clusters.
CASSINI introduces a novel geometric abstraction to consider the communication pattern of …

被引用次数：8 相关文章所有 5 个版本

[PDF] usenix.org

Transparent {GPU} sharing in container clouds for deep learning workloads

B Wu, Z Zhang, Z Bai, X Liu, X Jin - 20th USENIX Symposium on …, 2023 - usenix.org

Containers are widely used for resource management in datacenters. A common practice to
support deep learning (DL) training in container clouds is to statically bind GPUs to …

被引用次数：19 相关文章所有 5 个版本

[PDF] usenix.org

{KungFu}: Making training in distributed machine learning adaptive

L Mai, G Li, M Wagenländer, K Fertakis… - … USENIX Symposium on …, 2020 - usenix.org

When using distributed machine learning (ML) systems to train models on a cluster of worker
machines, users must configure a large number of parameters: hyper-parameters (eg the …

被引用次数：67 相关文章所有 9 个版本

[PDF] mlsys.org

Towards scalable distributed training of deep learning on public cloud clusters

S Shi, X Zhou, S Song, X Wang, Z Zhu… - Proceedings of …, 2021 - proceedings.mlsys.org

Distributed training techniques have been widely deployed in large-scale deep models
training on dense-GPU clusters. However, on public cloud clusters, due to the moderate …

被引用次数：55 相关文章所有 5 个版本

高级搜索

QQ 群