Topologies in distributed machine learning: Comprehensive survey, recommendations and future directions

L Liu, P Zhou, G Sun, X Chen, T Wu, H Yu, M Guizani - Neurocomputing, 2023 - Elsevier
With the widespread use of distributed machine learning (DML), many IT companies have
established networks dedicated to DML. Different communication architectures of DML have …

When should the network be the computer?

DRK Ports, J Nelson - Proceedings of the Workshop on Hot Topics in …, 2019 - dl.acm.org
Researchers have repurposed programmable network devices to place small amounts of
application computation in the network, sometimes yielding orders-of-magnitude …

Compressed communication for distributed deep learning: Survey and quantitative evaluation

H Xu, CY Ho, AM Abdelmoniem, A Dutta, EH Bergou… - 2020 - repository.kaust.edu.sa
Powerful computer clusters are used nowadays to train complex deep neural networks
(DNN) on large datasets. Distributed training workloads increasingly become …

Merlin hugeCTR: GPU-accelerated recommender system training and inference

Z Wang, Y Wei, M Lee, M Langer, F Yu, J Liu… - Proceedings of the 16th …, 2022 - dl.acm.org
In this talk, we introduce Merlin HugeCTR. Merlin HugeCTR is an open source, GPU-
accelerated integration framework for click-through rate estimation. It optimizes both training …

Prague: High-performance heterogeneity-aware asynchronous decentralized training

Q Luo, J He, Y Zhuo, X Qian - Proceedings of the Twenty-Fifth …, 2020 - dl.acm.org
Distributed deep learning training usually adopts All-Reduce as the synchronization
mechanism for data parallel algorithms due to its high performance in homogeneous …

Geryon: Accelerating distributed CNN training by network-level flow scheduling

S Wang, D Li, J Geng - IEEE INFOCOM 2020-IEEE Conference …, 2020 - ieeexplore.ieee.org
Increasingly rich data sets and complicated models make distributed machine learning more
and more important. However, the cost of extensive and frequent parameter …

An in-network architecture for accelerating shared-memory multiprocessor collectives

B Klenk, N Jiang, G Thorson… - 2020 ACM/IEEE 47th …, 2020 - ieeexplore.ieee.org
The slowdown of single-chip performance scaling combined with the growing demands of
computing ever larger problems efficiently has led to a renewed interest in distributed …

GRID: Gradient routing with in-network aggregation for distributed training

J Fang, G Zhao, H Xu, C Wu… - IEEE/ACM Transactions on …, 2023 - ieeexplore.ieee.org
As the scale of distributed training increases, it brings huge communication overhead in
clusters. Some works try to reduce the communication cost through gradient compression or …

Optimizing network performance for distributed dnn training on gpu clusters: Imagenet/alexnet training in 1.5 minutes

P Sun, W Feng, R Han, S Yan, Y Wen - arXiv preprint arXiv:1902.06855, 2019 - arxiv.org
It is important to scale out deep neural network (DNN) training for reducing model training
time. The high communication overhead is one of the major performance bottlenecks for …

Gradient compression supercharged high-performance data parallel dnn training

Y Bai, C Li, Q Zhou, J Yi, P Gong, F Yan… - Proceedings of the …, 2021 - dl.acm.org
Gradient compression is a promising approach to alleviating the communication bottleneck
in data parallel deep neural network (DNN) training by significantly reducing the data …