Distributed computing has become a common practice nowadays, where recent focus has been given to the usage of smart networking devices with in-network computing capabilities …
E Shulgin, P Richtárik - Uncertainty in Artificial Intelligence, 2022 - proceedings.mlr.press
Communication is one of the key bottlenecks in the distributed training of large-scale machine learning models, and lossy compression of exchanged information, such as …
Deep learning (DL) has become a key technique for solving complex problems in scientific research and discovery. DL training for science is substantially challenging because it has to …
W Li, X Yuan, K Li, H Qi, X Zhou… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Over the last decade, we have witnessed growing data volumes generated and stored across geographically distributed datacenters. Processing such geo-distributed datasets …
S Wang, D Li, J Zhang, W Lin - … of the 16th International Conference on …, 2020 - dl.acm.org
Iterative Synchronous Applications (ISApps) are popular in today's data centers, represented by distributed deep learning (DL) training. In ISApps, multiple nodes carry out the computing …
Q Zhang, G Zhao, H Xu, P Yang - IEEE/ACM Transactions on …, 2023 - ieeexplore.ieee.org
With the growth of model/dataset/system size for distributed model training in datacenters, the widely used Parameter Server (PS) architecture suffers from communication bottleneck …
H Zhou, Z Li, Q Cai, H Yu, S Luo, L Luo… - Future Generation …, 2021 - Elsevier
Distributed machine learning is a mainstream system to learn insights for analytics and intelligence services of many fronts (eg, health, streaming and business) from their massive …
J Geng, D Li, S Wang - Proceedings of the 10th Workshop on Scientific …, 2019 - dl.acm.org
Data parallelism and model parallelism are two typical parallel modes for distributed machine learning (DML). Traditionally, DML mainly leverages data parallelism, which …
L Liu, Q Jin, D Wang, H Yu, G Sun, S Luo - Future Generation Computer …, 2020 - Elsevier
Abstract The bottleneck of Distributed Machine Learning (DML) has shifted from computation to communication. Lots of works have focused on speeding up communication phase from …