Accelerating Distributed Training With Collaborative In-Network Aggregation

J Fang, H Xu, G Zhao, Z Yu, B Shen… - IEEE/ACM Transactions …, 2024 - ieeexplore.ieee.org
The surging scale of distributed training (DT) incurs significant communication overhead in
datacenters, while a promising solution is in-network aggregation (INA). It leverages …

Constrained in-network computing with low congestion in datacenter networks

R Segal, C Avin, G Scalosub - IEEE INFOCOM 2022-IEEE …, 2022 - ieeexplore.ieee.org
Distributed computing has become a common practice nowadays, where recent focus has
been given to the usage of smart networking devices with in-network computing capabilities …

Shifted compression framework: Generalizations and improvements

E Shulgin, P Richtárik - Uncertainty in Artificial Intelligence, 2022 - proceedings.mlr.press
Communication is one of the key bottlenecks in the distributed training of large-scale
machine learning models, and lossy compression of exchanged information, such as …

A quantitative study of deep learning training on heterogeneous supercomputers

J Han, L Xu, M Rafique, AR Butt, SH Lim - 2019 - osti.gov
Deep learning (DL) has become a key technique for solving complex problems in scientific
research and discovery. DL training for science is substantially challenging because it has to …

Endpoint-flexible coflow scheduling across geo-distributed datacenters

W Li, X Yuan, K Li, H Qi, X Zhou… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Over the last decade, we have witnessed growing data volumes generated and stored
across geographically distributed datacenters. Processing such geo-distributed datasets …

CEFS: Compute-efficient flow scheduling for iterative synchronous applications

S Wang, D Li, J Zhang, W Lin - … of the 16th International Conference on …, 2020 - dl.acm.org
Iterative Synchronous Applications (ISApps) are popular in today's data centers, represented
by distributed deep learning (DL) training. In ISApps, multiple nodes carry out the computing …

XAgg: Accelerating Heterogeneous Distributed Training Through XDP-Based Gradient Aggregation

Q Zhang, G Zhao, H Xu, P Yang - IEEE/ACM Transactions on …, 2023 - ieeexplore.ieee.org
With the growth of model/dataset/system size for distributed model training in datacenters,
the widely used Parameter Server (PS) architecture suffers from communication bottleneck …

DGT: A contribution-aware differential gradient transmission mechanism for distributed machine learning

H Zhou, Z Li, Q Cai, H Yu, S Luo, L Luo… - Future Generation …, 2021 - Elsevier
Distributed machine learning is a mainstream system to learn insights for analytics and
intelligence services of many fronts (eg, health, streaming and business) from their massive …

Horizontal or vertical? a hybrid approach to large-scale distributed machine learning

J Geng, D Li, S Wang - Proceedings of the 10th Workshop on Scientific …, 2019 - dl.acm.org
Data parallelism and model parallelism are two typical parallel modes for distributed
machine learning (DML). Traditionally, DML mainly leverages data parallelism, which …

PSNet: Reconfigurable network topology design for accelerating parameter server architecture based distributed machine learning

L Liu, Q Jin, D Wang, H Yu, G Sun, S Luo - Future Generation Computer …, 2020 - Elsevier
Abstract The bottleneck of Distributed Machine Learning (DML) has shifted from computation
to communication. Lots of works have focused on speeding up communication phase from …