A linear speedup analysis of distributed deep learning with sparse and quantized communication

P Jiang, G Agrawal - Advances in Neural Information …, 2018 - proceedings.neurips.cc
The large communication overhead has imposed a bottleneck on the performance of
distributed Stochastic Gradient Descent (SGD) for training deep neural networks. Previous …

Understanding top-k sparsification in distributed deep learning

S Shi, X Chu, KC Cheung, S See - arXiv preprint arXiv:1911.08772, 2019 - arxiv.org
Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training
large-scale deep learning models, while the communication overhead among workers …

Trading redundancy for communication: Speeding up distributed SGD for non-convex optimization

F Haddadpour, MM Kamani… - International …, 2019 - proceedings.mlr.press
Communication overhead is one of the key challenges that hinders the scalability of
distributed optimization algorithms to train large neural networks. In recent years, there has …

Communication-efficient distributed deep learning: A comprehensive survey

Z Tang, S Shi, W Wang, B Li, X Chu - arXiv preprint arXiv:2003.06307, 2020 - arxiv.org
Distributed deep learning (DL) has become prevalent in recent years to reduce training time
by leveraging multiple computing devices (eg, GPUs/TPUs) due to larger models and …

Communication-efficient distributed deep learning with merged gradient sparsification on GPUs

S Shi, Q Wang, X Chu, B Li, Y Qin… - IEEE INFOCOM 2020 …, 2020 - ieeexplore.ieee.org
Distributed synchronous stochastic gradient descent (SGD) algorithms are widely used in
large-scale deep learning applications, while it is known that the communication bottleneck …

Double quantization for communication-efficient distributed optimization

Y Yu, J Wu, L Huang - Advances in Neural Information …, 2019 - proceedings.neurips.cc
Modern distributed training of machine learning models often suffers from high
communication overhead for synchronizing stochastic gradients and model parameters. In …

Error compensated quantized SGD and its applications to large-scale distributed optimization

J Wu, W Huang, J Huang… - … Conference on Machine …, 2018 - proceedings.mlr.press
Large-scale distributed optimization is of great importance in various applications. For data-
parallel based distributed learning, the inter-node gradient communication often becomes …

Qsparse-local-SGD: Distributed SGD with quantization, sparsification and local computations

D Basu, D Data, C Karakus… - Advances in Neural …, 2019 - proceedings.neurips.cc
Communication bottleneck has been identified as a significant issue in distributed
optimization of large-scale learning models. Recently, several approaches to mitigate this …

Compressed communication for distributed deep learning: Survey and quantitative evaluation

H Xu, CY Ho, AM Abdelmoniem, A Dutta, EH Bergou… - 2020 - repository.kaust.edu.sa
Powerful computer clusters are used nowadays to train complex deep neural networks
(DNN) on large datasets. Distributed training workloads increasingly become …

Near-optimal sparse allreduce for distributed deep learning

S Li, T Hoefler - Proceedings of the 27th ACM SIGPLAN Symposium on …, 2022 - dl.acm.org
Communication overhead is one of the major obstacles to train large deep learning models
at scale. Gradient sparsification is a promising technique to reduce the communication …