Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools

R Mayer, HA Jacobsen - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …

A survey of techniques for optimizing deep learning on GPUs

S Mittal, S Vaishay - Journal of Systems Architecture, 2019 - Elsevier
The rise of deep-learning (DL) has been fuelled by the improvements in accelerators. Due to
its unique features, the GPU continues to remain the most widely used accelerator for DL …

Liquid: Intelligent resource estimation and network-efficient scheduling for deep learning jobs on distributed GPU clusters

R Gu, Y Chen, S Liu, H Dai, G Chen… - … on Parallel and …, 2021 - ieeexplore.ieee.org
Deep learning (DL) is becoming increasingly popular in many domains, including computer
vision, speech recognition, self-driving automobiles, etc. GPU can train DL models efficiently …

A review on community detection in large complex networks from conventional to deep learning methods: A call for the use of parallel meta-heuristic algorithms

MN Al-Andoli, SC Tan, WP Cheah, SY Tan - IEEE Access, 2021 - ieeexplore.ieee.org
Complex networks (CNs) have gained much attention in recent years due to their
importance and popularity. The rapid growth in the size of CNs leads to more difficulties in …

Optimizing distributed training deployment in heterogeneous GPU clusters

X Yi, S Zhang, Z Luo, G Long, L Diao, C Wu… - Proceedings of the 16th …, 2020 - dl.acm.org
This paper proposes HeteroG, an automatic module to accelerate deep neural network
training in heterogeneous GPU clusters. To train a deep learning model with large amounts …

Fast training of deep learning models over multiple gpus

X Yi, Z Luo, C Meng, M Wang, G Long, C Wu… - Proceedings of the 21st …, 2020 - dl.acm.org
This paper proposes FastT, a transparent module to work with the TensorFlow framework for
automatically identifying a satisfying deployment and execution order of operations in DNN …

Marble: A multi-gpu aware job scheduler for deep learning on hpc systems

J Han, MM Rafique, L Xu, AR Butt… - 2020 20th IEEE/ACM …, 2020 - ieeexplore.ieee.org
Deep learning (DL) has become a key tool for solving complex scientific problems. However,
managing the multi-dimensional large-scale data associated with DL, especially atop extant …

Garfield: System support for byzantine machine learning (regular paper)

R Guerraoui, A Guirguis, J Plassmann… - 2021 51st Annual …, 2021 - ieeexplore.ieee.org
We present GARFIELD, a library to transparently make machine learning (ML) applications,
initially built with popular (but fragile) frameworks, eg, TensorFlow and PyTorch, Byzantine …

PSNet: Reconfigurable network topology design for accelerating parameter server architecture based distributed machine learning

L Liu, Q Jin, D Wang, H Yu, G Sun, S Luo - Future Generation Computer …, 2020 - Elsevier
Abstract The bottleneck of Distributed Machine Learning (DML) has shifted from computation
to communication. Lots of works have focused on speeding up communication phase from …

Online job scheduling for distributed machine learning in optical circuit switch networks

L Liu, H Yu, G Sun, H Zhou, Z Li, S Luo - Knowledge-Based Systems, 2020 - Elsevier
Networking has become a well-known performance bottleneck for distributed machine
learning (DML). Although lots of works have focused on accelerating the communication …