Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools

R Mayer, HA Jacobsen - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …

A comparative measurement study of deep learning as a service framework

Y Wu, L Liu, C Pu, W Cao, S Sahin… - IEEE Transactions on …, 2019 - ieeexplore.ieee.org
Big data powered Deep Learning (DL) and its applications have blossomed in recent years,
fueled by three technological trends: a large amount of digitized data openly accessible, a …

A survey of deep learning on mobile devices: Applications, optimizations, challenges, and research opportunities

T Zhao, Y Xie, Y Wang, J Cheng, X Guo… - Proceedings of the …, 2022 - ieeexplore.ieee.org
Deep learning (DL) has demonstrated great performance in various applications on
powerful computers and servers. Recently, with the advancement of more powerful mobile …

Benchmarking deep learning frameworks: Design considerations, metrics and beyond

L Liu, Y Wu, W Wei, W Cao, S Sahin… - 2018 IEEE 38th …, 2018 - ieeexplore.ieee.org
With increasing number of open-source deep learning (DL) software tools made available,
benchmarking DL software frameworks and systems is in high demand. This paper presents …

Eflops: Algorithm and system co-design for a high performance distributed training platform

J Dong, Z Cao, T Zhang, J Ye, S Wang… - … Symposium on High …, 2020 - ieeexplore.ieee.org
Deep neural networks (DNNs) have gained tremendous attractions as compelling solutions
for applications such as image classification, object detection, speech recognition, and so …

Distributed training of deep learning models: A taxonomic perspective

M Langer, Z He, W Rahayu… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Distributed deep learning systems (DDLS) train deep neural network models by utilizing the
distributed resources of a cluster. Developers of DDLS are required to make many decisions …

Astra-sim: Enabling sw/hw co-design exploration for distributed dl training platforms

S Rashidi, S Sridharan, S Srinivasan… - … Analysis of Systems …, 2020 - ieeexplore.ieee.org
Modern Deep Learning systems heavily rely on distributed training over high-performance
accelerator (eg, TPU, GPU)-based hardware platforms. Examples today include Google's …

[HTML][HTML] A systematic literature review on distributed machine learning in edge computing

CP Filho, E Marques Jr, V Chang, L Dos Santos… - Sensors, 2022 - mdpi.com
Distributed edge intelligence is a disruptive research area that enables the execution of
machine learning and deep learning (ML/DL) algorithms close to where data are generated …

Parameter hub: a rack-scale parameter server for distributed deep neural network training

L Luo, J Nelson, L Ceze, A Phanishayee… - Proceedings of the …, 2018 - dl.acm.org
Distributed deep neural network (DDNN) training constitutes an increasingly important
workload that frequently runs in the cloud. Larger DNN models and faster compute engines …

Project adam: Building an efficient and scalable deep learning training system

T Chilimbi, Y Suzue, J Apacible… - 11th USENIX symposium …, 2014 - usenix.org
Large deep neural network models have recently demonstrated state-of-the-art accuracy on
hard visual recognition tasks. Unfortunately such models are extremely time consuming to …