Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools

R Mayer, HA Jacobsen - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …

Orchestrating the development lifecycle of machine learning-based IoT applications: A taxonomy and survey

B Qian, J Su, Z Wen, DN Jha, Y Li, Y Guan… - ACM Computing …, 2020 - dl.acm.org
Machine Learning (ML) and Internet of Things (IoT) are complementary advances: ML
techniques unlock the potential of IoT with intelligence, and IoT applications increasingly …

Privacy preserving machine learning with homomorphic encryption and federated learning

H Fang, Q Qian - Future Internet, 2021 - mdpi.com
Privacy protection has been an important concern with the great success of machine
learning. In this paper, it proposes a multi-party privacy preserving machine learning …

Optimus: an efficient dynamic resource scheduler for deep learning clusters

Y Peng, Y Bao, Y Chen, C Wu, C Guo - Proceedings of the Thirteenth …, 2018 - dl.acm.org
Deep learning workloads are common in today's production clusters due to the proliferation
of deep learning driven AI services (eg, speech recognition, machine translation). A deep …

Gaia:{Geo-Distributed} machine learning approaching {LAN} speeds

K Hsieh, A Harlap, N Vijaykumar, D Konomis… - … USENIX Symposium on …, 2017 - usenix.org
Machine learning (ML) is widely used to derive useful information from large-scale data
(such as user activities, pictures, and videos) generated at increasingly rapid rates, all over …

Pipedream: Fast and efficient pipeline parallel dnn training

A Harlap, D Narayanan, A Phanishayee… - arXiv preprint arXiv …, 2018 - arxiv.org
PipeDream is a Deep Neural Network (DNN) training system for GPUs that parallelizes
computation by pipelining execution across multiple machines. Its pipeline parallel …

{HetPipe}: Enabling large {DNN} training on (whimpy) heterogeneous {GPU} clusters through integration of pipelined model parallelism and data parallelism

JH Park, G Yun, MY Chang, NT Nguyen, S Lee… - 2020 USENIX Annual …, 2020 - usenix.org
Deep Neural Network (DNN) models have continuously been growing in size in order to
improve the accuracy and quality of the models. Moreover, for training of large DNN models …

Ppdsparse: A parallel primal-dual sparse method for extreme classification

IEH Yen, X Huang, W Dai, P Ravikumar… - Proceedings of the 23rd …, 2017 - dl.acm.org
Extreme Classification comprises multi-class or multi-label prediction where there is a large
number of classes, and is increasingly relevant to many real-world applications such as text …

HET: scaling out huge embedding model training via cache-enabled distributed framework

X Miao, H Zhang, Y Shi, X Nie, Z Yang, Y Tao… - arXiv preprint arXiv …, 2021 - arxiv.org
Embedding models have been an effective learning paradigm for high-dimensional data.
However, one open issue of embedding models is that their representations (latent factors) …

Supporting very large models using automatic dataflow graph partitioning

M Wang, C Huang, J Li - … of the Fourteenth EuroSys Conference 2019, 2019 - dl.acm.org
This paper presents Tofu, a system that partitions very large DNN models across multiple
GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow …