A quick survey on large scale distributed deep learning systems

Z Zhang, L Yin, Y Peng, D Li - 2018 IEEE 24th International …, 2018 - ieeexplore.ieee.org
Deep learning have been widely used in various fields and has worked very well as a major
role. While the gradual penetration into various fields, data quantity of each applications is …

An in-depth analysis of distributed training of deep neural networks

Y Ko, K Choi, J Seo, SW Kim - 2021 IEEE International Parallel …, 2021 - ieeexplore.ieee.org
As the popularity of deep learning in industry rapidly grows, efficient training of deep neural
networks (DNNs) becomes important. To train a DNN with a large amount of data, distributed …

HPDL: towards a general framework for high-performance distributed deep learning

D Li, Z Lai, K Ge, Y Zhang, Z Zhang… - 2019 IEEE 39th …, 2019 - ieeexplore.ieee.org
With growing scale of the data volume and neural network size, we have come into the era
of distributed deep learning. High-performance training and inference on distributed …

Distributed deep learning of ResNet50 and VGG16 with pipeline parallelism

N Takisawa, S Yazaki, H Ishihata - 2020 Eighth International …, 2020 - ieeexplore.ieee.org
Data parallel distributed deep learning has been used to accelerate the learning speed. The
communication is becoming a bottleneck as the computation time is accelerated. Therefore …

Model accuracy and runtime tradeoff in distributed deep learning: A systematic study

S Gupta, W Zhang, F Wang - 2016 IEEE 16th International …, 2016 - ieeexplore.ieee.org
Deep learning with a large number of parametersrequires distributed training, where model
accuracy and runtimeare two important factors to be considered. However, there hasbeen …

[HTML][HTML] From distributed machine to distributed deep learning: a comprehensive survey

M Dehghani, Z Yazdanparast - Journal of Big Data, 2023 - Springer
Artificial intelligence has made remarkable progress in handling complex tasks, thanks to
advances in hardware acceleration and machine learning algorithms. However, to acquire …

Performance optimizations and analysis of distributed deep learning with approximated second-order optimization method

Y Tsuji, K Osawa, Y Ueno, A Naruse, R Yokota… - … Proceedings of the …, 2019 - dl.acm.org
Faster training of deep neural networks is desired to speed up the research and
development cycle in deep learning. Distributed deep learning and second-order …

A network-centric hardware/algorithm co-design to accelerate distributed training of deep neural networks

Y Li, J Park, M Alian, Y Yuan, Z Qu… - 2018 51st Annual …, 2018 - ieeexplore.ieee.org
Training real-world Deep Neural Networks (DNNs) can take an eon (ie, weeks or months)
without leveraging distributed systems. Even distributed training takes inordinate time, of …

An allreduce algorithm and network co-design for large-scale training of distributed deep learning

TT Nguyen, M Wahib - 2021 IEEE/ACM 21st International …, 2021 - ieeexplore.ieee.org
Distributed training of Deep Neural Networks (DNNs) on High-Performance Computing
(HPC) systems is becoming increasingly common. HPC systems dedicated entirely or …

Nexus: Bringing efficient and scalable training to deep learning frameworks

Y Wang, L Zhang, Y Ren… - 2017 IEEE 25th …, 2017 - ieeexplore.ieee.org
Demand is mounting in the industry for scalable GPU-based deep learning systems.
Unfortunately, existing training applications built atop popular deep learning frameworks …