A Generic, High-Performance, Compression-Aware Framework for Data Parallel DNN Training

H Wu, S Wang, Y Bai, C Li, Q Zhou, J Yi… - … on Parallel and …, 2023 - ieeexplore.ieee.org
Gradient compression is a promising approach to alleviating the communication bottleneck
in data parallel deep neural network (DNN) training by significantly reducing the data …

AggTree: A Routing Tree With In-Network Aggregation for Distributed Training

J Nie, W Wu - 2023 IEEE International Performance, Computing …, 2023 - ieeexplore.ieee.org
For distributed training (DT) based on the parameter servers (PS) architecture, the
communication overhead is huge in the network for servers synchronizing parameters. In the …

BPCM: a flexible high-speed bypass parallel communication mechanism for GPU cluster

M Wu, Q Chen, J Wang - IEEE Access, 2020 - ieeexplore.ieee.org
With the increasing complexity of computational tasks faced by artificial intelligence
technology, the scale of machine learning models continues to expand, and the data volume …

OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICs

T Gu, J Fei, M Canini - Proceedings of the 2024 SIGCOMM Workshop on …, 2024 - dl.acm.org
AllReduce is a collective communication pattern commonly used in Distributed Deep
Learning (DDL) and High Performance Computing (HPC). Sparse AllReduce, which …

Libra: Contention-Aware GPU Thread Allocation for Data Parallel Training in High Speed Networks

Y Liu, B Jiang, S Zhao, T Lin, X Wang… - IEEE INFOCOM 2023 …, 2023 - ieeexplore.ieee.org
Overlapping gradient communication with backward computation is a popular technique to
reduce communication cost in the widely adopted data parallel S-SGD training. However …

Heterogeneity-aware asynchronous decentralized training

Q Luo, J He, Y Zhuo, X Qian - arXiv preprint arXiv:1909.08029, 2019 - arxiv.org
Distributed deep learning training usually adopts All-Reduce as the synchronization
mechanism for data parallel algorithms due to its high performance in homogeneous …

Enhancing the performance assessment of network-based and machine learning for module availability estimation

AL Challoob, AH Hussein - International Journal of System …, 2024 - inderscienceonline.com
Interpreting network telemetry data is difficult. Size and volume are network assets.
Production rises. ML predicts traffic trends to help decision-making. Classification and …

Freezepipe: An efficient dynamic pipeline parallel approach based on freezing mechanism for distributed dnn training

C Weng, Z Shu, Z Xu, J Zhang, J Luo… - … Cooperative Work in …, 2023 - ieeexplore.ieee.org
Deep Neural Network (DNN) training on a large scale is extremely time-consuming and
computationally intensive, which is accelerated by distributed training. In recent years …

An in-network parameter aggregation using DPDK for multi-GPU deep learning

M Furukawa, T Itsubo… - 2020 Eighth International …, 2020 - ieeexplore.ieee.org
In distributed deep neural network using remote GPU nodes, communication occurs
iteratively between remote nodes for gradient aggregation. This communication latency …

Understanding the performance of in-network computing: A case study

F Yang, Z Wang, X Ma, G Yuan… - 2019 IEEE Intl Conf on …, 2019 - ieeexplore.ieee.org
Numerous distributed applications, including machine learning and big data analysis, have
suffered performance degradation from network bottleneck. To solve this problem …