[HTML][HTML] Strategies and principles of distributed machine learning on big data

EP Xing, Q Ho, P Xie, D Wei - Engineering, 2016 - Elsevier
The rise of big data has led to new demands for machine learning (ML) systems to learn
complex models, with millions to billions of parameters, that promise adequate capacity to …

Orchestrating the development lifecycle of machine learning-based IoT applications: A taxonomy and survey

B Qian, J Su, Z Wen, DN Jha, Y Li, Y Guan… - ACM Computing …, 2020 - dl.acm.org
Machine Learning (ML) and Internet of Things (IoT) are complementary advances: ML
techniques unlock the potential of IoT with intelligence, and IoT applications increasingly …

Gpipe: Efficient training of giant neural networks using pipeline parallelism

Y Huang, Y Cheng, A Bapna, O Firat… - Advances in neural …, 2019 - proceedings.neurips.cc
Scaling up deep neural network capacity has been known as an effective approach to
improving model quality for several different machine learning tasks. In many cases …

Pipedream: Fast and efficient pipeline parallel dnn training

A Harlap, D Narayanan, A Phanishayee… - arXiv preprint arXiv …, 2018 - arxiv.org
PipeDream is a Deep Neural Network (DNN) training system for GPUs that parallelizes
computation by pipelining execution across multiple machines. Its pipeline parallel …

Petuum: A new platform for distributed machine learning on big data

EP Xing, Q Ho, W Dai, JK Kim, J Wei, S Lee… - Proceedings of the 21th …, 2015 - dl.acm.org
How can one build a distributed framework that allows efficient deployment of a wide
spectrum of modern advanced machine learning (ML) programs for industrial-scale …

FDML: A collaborative machine learning framework for distributed features

Y Hu, D Niu, J Yang, S Zhou - Proceedings of the 25th ACM SIGKDD …, 2019 - dl.acm.org
Most current distributed machine learning systems try to scale up model training by using a
data-parallel architecture that divides the computation for different samples among workers …

Pyramid: Enabling hierarchical neural networks with edge computing

Q He, Z Dong, F Chen, S Deng, W Liang… - Proceedings of the ACM …, 2022 - dl.acm.org
Machine learning (ML) is powering a rapidly-increasing number of web applications. As a
crucial part of 5G, edge computing facilitates edge artificial intelligence (AI) by ML model …

{HetPipe}: Enabling large {DNN} training on (whimpy) heterogeneous {GPU} clusters through integration of pipelined model parallelism and data parallelism

JH Park, G Yun, MY Chang, NT Nguyen, S Lee… - 2020 USENIX Annual …, 2020 - usenix.org
Deep Neural Network (DNN) models have continuously been growing in size in order to
improve the accuracy and quality of the models. Moreover, for training of large DNN models …

SiP-ML: high-bandwidth optical network interconnects for machine learning training

M Khani, M Ghobadi, M Alizadeh, Z Zhu… - Proceedings of the …, 2021 - dl.acm.org
This paper proposes optical network interconnects as a key enabler for building high-
bandwidth ML training clusters with strong scaling properties. Our design, called SiP-ML …

Lightlda: Big topic models on modest computer clusters

J Yuan, F Gao, Q Ho, W Dai, J Wei, X Zheng… - Proceedings of the 24th …, 2015 - dl.acm.org
When building large-scale machine learning (ML) programs, such as massive topic models
or deep neural networks with up to trillions of parameters and training examples, one usually …