Communication-efficient distributed deep learning: A comprehensive survey

Z Tang, S Shi, W Wang, B Li, X Chu - arXiv preprint arXiv:2003.06307, 2020 - arxiv.org
Distributed deep learning (DL) has become prevalent in recent years to reduce training time
by leveraging multiple computing devices (eg, GPUs/TPUs) due to larger models and …

Offloading machine learning to programmable data planes: A systematic survey

R Parizotto, BL Coelho, DC Nunes, I Haque… - ACM Computing …, 2023 - dl.acm.org
The demand for machine learning (ML) has increased significantly in recent decades,
enabling several applications, such as speech recognition, computer vision, and …

Scaling distributed machine learning with {In-Network} aggregation

A Sapio, M Canini, CY Ho, J Nelson, P Kalnis… - … USENIX Symposium on …, 2021 - usenix.org
Training machine learning models in parallel is an increasingly important workload. We
accelerate distributed parallel training by designing a communication primitive that uses a …

Grace: A compressed communication framework for distributed machine learning

H Xu, CY Ho, AM Abdelmoniem, A Dutta… - 2021 IEEE 41st …, 2021 - ieeexplore.ieee.org
Powerful computer clusters are used nowadays to train complex deep neural networks
(DNN) on large datasets. Distributed training increasingly becomes communication bound …

Natural compression for distributed deep learning

S Horvóth, CY Ho, L Horvath, AN Sahu… - Mathematical and …, 2022 - proceedings.mlr.press
Modern deep learning models are often trained in parallel over a collection of distributed
machines to reduce training time. In such settings, communication of model updates among …

Near-optimal sparse allreduce for distributed deep learning

S Li, T Hoefler - Proceedings of the 27th ACM SIGPLAN Symposium on …, 2022 - dl.acm.org
Communication overhead is one of the major obstacles to train large deep learning models
at scale. Gradient sparsification is a promising technique to reduce the communication …

From luna to solar: the evolutions of the compute-to-storage networks in alibaba cloud

R Miao, L Zhu, S Ma, K Qian, S Zhuang, B Li… - Proceedings of the …, 2022 - dl.acm.org
This paper presents the two generations of storage network stacks that reduced the average
I/O latency of Alibaba Cloud's EBS service by 72% in the last five years: Luna, a user-space …

Advancements in accelerating deep neural network inference on aiot devices: A survey

L Cheng, Y Gu, Q Liu, L Yang, C Liu… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
The amalgamation of artificial intelligence with Internet of Things (AIoT) devices have seen a
rapid surge in growth, largely due to the effective implementation of deep neural network …

Flare: Flexible in-network allreduce

D De Sensi, S Di Girolamo, S Ashkboos, S Li… - Proceedings of the …, 2021 - dl.acm.org
The allreduce operation is one of the most commonly used communication routines in
distributed applications. To improve its bandwidth and to reduce network traffic, this …

GRID: Gradient routing with in-network aggregation for distributed training

J Fang, G Zhao, H Xu, C Wu… - IEEE/ACM Transactions on …, 2023 - ieeexplore.ieee.org
As the scale of distributed training increases, it brings huge communication overhead in
clusters. Some works try to reduce the communication cost through gradient compression or …