Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools

R Mayer, HA Jacobsen - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …

[HTML][HTML] Deep neural networks in the cloud: Review, applications, challenges and research directions

KY Chan, B Abu-Salih, R Qaddoura, AZ Ala'M… - Neurocomputing, 2023 - Elsevier
Deep neural networks (DNNs) are currently being deployed as machine learning technology
in a wide range of important real-world applications. DNNs consist of a huge number of …

Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning

A Qiao, SK Choe, SJ Subramanya… - … on Operating Systems …, 2021 - usenix.org
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-
optimizing inter-dependent factors both at the per-job level and at the cluster-wide level …

Bamboo: Making preemptible instances resilient for affordable training of large {DNNs}

J Thorpe, P Zhao, J Eyolfson, Y Qiao, Z Jia… - … USENIX Symposium on …, 2023 - usenix.org
DNN models across many domains continue to grow in size, resulting in high resource
requirements for effective training, and unpalatable (and often unaffordable) costs for …

{KungFu}: Making training in distributed machine learning adaptive

L Mai, G Li, M Wagenländer, K Fertakis… - … USENIX Symposium on …, 2020 - usenix.org
When using distributed machine learning (ML) systems to train models on a cluster of worker
machines, users must configure a large number of parameters: hyper-parameters (eg the …

DL2: A deep learning-driven scheduler for deep learning clusters

Y Peng, Y Bao, Y Chen, C Wu… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Efficient resource scheduling is essential for maximal utilization of expensive deep learning
(DL) clusters. Existing cluster schedulers either are agnostic to machine learning (ML) …

Fleet: Online federated learning via staleness awareness and performance prediction

G Damaskinos, R Guerraoui, AM Kermarrec… - ACM Transactions on …, 2022 - dl.acm.org
Federated learning (FL) is very appealing for its privacy benefits: essentially, a global model
is trained with updates computed on mobile devices while keeping the data of users local …

Distributed deep learning on data systems: a comparative analysis of approaches

Y Zhang, F Mcquillan, N Jayaram, N Kak… - Proceedings of the …, 2021 - par.nsf.gov
Deep learning (DL) is growing in popularity for many data analytics applications, including
among enterprises. Large business-critical datasets in such settings typically reside in …

Crossbow: Scaling deep learning with small batch sizes on multi-gpu servers

A Koliousis, P Watcharapichat, M Weidlich… - arXiv preprint arXiv …, 2019 - arxiv.org
Deep learning models are trained on servers with many GPUs, and training must scale with
the number of GPUs. Systems such as TensorFlow and Caffe2 train models with parallel …

Resource elasticity in distributed deep learning

A Or, H Zhang, M Freedman - Proceedings of Machine …, 2020 - proceedings.mlsys.org
Elasticity—scaling out or in depending upon resource demand or availability—allows a
system to improve its efficiency or performance. This leads to potentially significant cost …