Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools

R Mayer, HA Jacobsen - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …

[HTML][HTML] Strategies and principles of distributed machine learning on big data

EP Xing, Q Ho, P Xie, D Wei - Engineering, 2016 - Elsevier
The rise of big data has led to new demands for machine learning (ML) systems to learn
complex models, with millions to billions of parameters, that promise adequate capacity to …

Edge learning for B5G networks with distributed signal processing: Semantic communication, edge computing, and wireless sensing

W Xu, Z Yang, DWK Ng, M Levorato… - IEEE journal of …, 2023 - ieeexplore.ieee.org
To process and transfer large amounts of data in emerging wireless services, it has become
increasingly appealing to exploit distributed data communication and learning. Specifically …

A survey on federated learning

C Zhang, Y Xie, H Bai, B Yu, W Li, Y Gao - Knowledge-Based Systems, 2021 - Elsevier
Federated learning is a set-up in which multiple clients collaborate to solve machine
learning problems, which is under the coordination of a central aggregator. This setting also …

Privacy preserving machine learning with homomorphic encryption and federated learning

H Fang, Q Qian - Future Internet, 2021 - mdpi.com
Privacy protection has been an important concern with the great success of machine
learning. In this paper, it proposes a multi-party privacy preserving machine learning …

Scaling distributed machine learning with {In-Network} aggregation

A Sapio, M Canini, CY Ho, J Nelson, P Kalnis… - … USENIX Symposium on …, 2021 - usenix.org
Training machine learning models in parallel is an increasingly important workload. We
accelerate distributed parallel training by designing a communication primitive that uses a …

Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning

A Qiao, SK Choe, SJ Subramanya… - … on Operating Systems …, 2021 - usenix.org
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-
optimizing inter-dependent factors both at the per-job level and at the cluster-wide level …

Gaia:{Geo-Distributed} machine learning approaching {LAN} speeds

K Hsieh, A Harlap, N Vijaykumar, D Konomis… - … USENIX Symposium on …, 2017 - usenix.org
Machine learning (ML) is widely used to derive useful information from large-scale data
(such as user activities, pictures, and videos) generated at increasingly rapid rates, all over …

Poseidon: An efficient communication architecture for distributed deep learning on {GPU} clusters

H Zhang, Z Zheng, S Xu, W Dai, Q Ho, X Liang… - 2017 USENIX Annual …, 2017 - usenix.org
Deep learning models can take weeks to train on a single GPU-equipped machine,
necessitating scaling out DL training to a GPU-cluster. However, current distributed DL …

Cirrus: A serverless framework for end-to-end ml workflows

J Carreira, P Fonseca, A Tumanov, A Zhang… - Proceedings of the ACM …, 2019 - dl.acm.org
Machine learning (ML) workflows are extremely complex. The typical workflow consists of
distinct stages of user interaction, such as preprocessing, training, and tuning, that are …