Priority-based parameter propagation for distributed DNN training

Z Tang, S Shi, W Wang, B Li, X Chu - arXiv preprint arXiv:2003.06307, 2020 - arxiv.org

Distributed deep learning (DL) has become prevalent in recent years to reduce training time
by leveraging multiple computing devices (eg, GPUs/TPUs) due to larger models and …

被引用次数：134 相关文章所有 4 个版本

[PDF] arxiv.org

Pytorch distributed: Experiences on accelerating data parallel training

S Li, Y Zhao, R Varma, O Salpekar, P Noordhuis… - arXiv preprint arXiv …, 2020 - arxiv.org

This paper presents the design, implementation, and evaluation of the PyTorch distributed
data parallel module. PyTorch is a widely-adopted scientific computing package used in …

被引用次数：552 相关文章所有 11 个版本

[PDF] usenix.org

A unified architecture for accelerating distributed {DNN} training in heterogeneous {GPU/CPU} clusters

Y Jiang, Y Zhu, C Lan, B Yi, Y Cui, C Guo - 14th USENIX Symposium on …, 2020 - usenix.org

Data center clusters that run DNN training jobs are inherently heterogeneous. They have
GPUs and CPUs for computation and network bandwidth for distributed training. However …

被引用次数：294 相关文章所有 10 个版本

[PDF] arxiv.org

DAPPLE: A pipelined data parallel approach for training large models

S Fan, Y Rong, C Meng, Z Cao, S Wang… - Proceedings of the 26th …, 2021 - dl.acm.org

It is a challenging task to train large DNN models on sophisticated GPU platforms with
diversified interconnect capabilities. Recently, pipelined training has been proposed as an …

被引用次数：196 相关文章所有 11 个版本

[PDF] yibozhu.com

A generic communication scheduler for distributed DNN training acceleration

Y Peng, Y Zhu, Y Chen, Y Bao, B Yi, C Lan… - Proceedings of the 27th …, 2019 - dl.acm.org

We present ByteScheduler, a generic communication scheduler for distributed DNN training
acceleration. ByteScheduler is based on our principled analysis that partitioning and …

被引用次数：350 相关文章所有 7 个版本

[PDF] usenix.org

{MegaScale}: Scaling large language model training to more than 10,000 {GPUs}

Z Jiang, H Lin, Y Zhong, Q Huang, Y Chen… - … USENIX Symposium on …, 2024 - usenix.org

We present the design, implementation and engineering experience in building and
deploying MegaScale, a production system for training large language models (LLMs) at the …

被引用次数：36 相关文章所有 4 个版本

[PDF] usenix.org

{ATP}: In-network aggregation for multi-tenant learning

CL Lao, Y Le, K Mahajan, Y Chen, W Wu… - … USENIX Symposium on …, 2021 - usenix.org

Distributed deep neural network training (DT) systems are widely deployed in clusters where
the network is shared across multiple tenants, ie, multiple DT jobs. Each DT job computes …

被引用次数：216 相关文章所有 11 个版本

[PDF] usenix.org

{HetPipe}: Enabling large {DNN} training on (whimpy) heterogeneous {GPU} clusters through integration of pipelined model parallelism and data parallelism

JH Park, G Yun, MY Chang, NT Nguyen, S Lee… - 2020 USENIX Annual …, 2020 - usenix.org

Deep Neural Network (DNN) models have continuously been growing in size in order to
improve the accuracy and quality of the models. Moreover, for training of large DNN models …

被引用次数：137 相关文章所有 10 个版本

Fluid: Dataset abstraction and elastic acceleration for cloud-native deep learning training jobs

R Gu, K Zhang, Z Xu, Y Che, B Fan… - 2022 IEEE 38th …, 2022 - ieeexplore.ieee.org

Nowdays, it is prevalent to train deep learning (DL) models in cloud-native platforms that
actively leverage containerization and orchestration technologies for high elasticity, low and …

被引用次数：51 相关文章所有 2 个版本

[PDF] acm.org

SiP-ML: high-bandwidth optical network interconnects for machine learning training

M Khani, M Ghobadi, M Alizadeh, Z Zhu… - Proceedings of the …, 2021 - dl.acm.org

This paper proposes optical network interconnects as a key enabler for building high-
bandwidth ML training clusters with strong scaling properties. Our design, called SiP-ML …

被引用次数：76 相关文章所有 8 个版本

高级搜索

QQ 群