Distributed training of deep learning models: A taxonomic perspective

M Langer, Z He, W Rahayu… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Distributed deep learning systems (DDLS) train deep neural network models by utilizing the
distributed resources of a cluster. Developers of DDLS are required to make many decisions …

Performance modeling and scalability optimization of distributed deep learning systems

F Yan, O Ruwase, Y He, T Chilimbi - Proceedings of the 21th ACM …, 2015 - dl.acm.org
Big deep neural network (DNN) models trained on large amounts of data have recently
achieved the best accuracy on hard tasks, such as image and speech recognition. Training …

Oneflow: Redesign the distributed deep learning framework from scratch

J Yuan, X Li, C Cheng, J Liu, R Guo, S Cai… - arXiv preprint arXiv …, 2021 - arxiv.org
Deep learning frameworks such as TensorFlow and PyTorch provide a productive interface
for expressing and training a deep neural network (DNN) model on a single device or using …

Prediction of the resource consumption of distributed deep learning systems

G Yang, C Shin, J Lee, Y Yoo, C Yoo - … of the ACM on Measurement and …, 2022 - dl.acm.org
The prediction of the resource consumption for the distributed training of deep learning
models is of paramount importance, as it can inform a priori users how long their training …

Model accuracy and runtime tradeoff in distributed deep learning: A systematic study

S Gupta, W Zhang, F Wang - 2016 IEEE 16th International …, 2016 - ieeexplore.ieee.org
Deep learning with a large number of parametersrequires distributed training, where model
accuracy and runtimeare two important factors to be considered. However, there hasbeen …

Consensus control for decentralized deep learning

L Kong, T Lin, A Koloskova… - … on Machine Learning, 2021 - proceedings.mlr.press
Decentralized training of deep learning models enables on-device learning over networks,
as well as efficient scaling to large compute clusters. Experiments in earlier works reveal …

Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools

R Mayer, HA Jacobsen - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …

A survey on distributed machine learning

J Verbraeken, M Wolting, J Katzy… - Acm computing surveys …, 2020 - dl.acm.org
The demand for artificial intelligence has grown significantly over the past decade, and this
growth has been fueled by advances in machine learning techniques and the ability to …

Efficient decentralized deep learning by dynamic model averaging

M Kamp, L Adilova, J Sicking, F Hüger… - Machine Learning and …, 2019 - Springer
We propose an efficient protocol for decentralized training of deep neural networks from
distributed data sources. The proposed protocol allows to handle different phases of model …

Ako: Decentralised deep learning with partial gradient exchange

P Watcharapichat, VL Morales, RC Fernandez… - Proceedings of the …, 2016 - dl.acm.org
Distributed systems for the training of deep neural networks (DNNs) with large amounts of
data have vastly improved the accuracy of machine learning models for image and speech …