Accelerating Distributed DNN Training via Transport Layer Scheduling

Q Duan, C Peng, Z Wang, Y Xu, S Liu… - … on Parallel and …, 2023 - ieeexplore.ieee.org
Communication scheduling is crucial to accelerate the training of large deep learning
models, in which the transmission order of layer-wise deep neural network (DNN) tensors is …

Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models

W Wang, Z Lai, S Li, W Liu, K Ge, Y Liu… - 2023 IEEE …, 2023 - ieeexplore.ieee.org
Mixture of Expert (MoE) has received increasing attention for scaling DNN models to extra-
large size with negligible increases in computation. The MoE model has achieved the …

US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning

Y Gao, B Hu, MB Mashhadi, AL Jin… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
The communication bottleneck severely constrains the scalability of distributed deep
learning, and efficient communication scheduling accelerates distributed DNN training by …

An efficient bandwidth-adaptive gradient compression algorithm for distributed training of deep neural networks

Z Wang, Q Duan, Y Xu, L Zhang - Journal of Systems Architecture, 2024 - Elsevier
In distributed deep learning with data parallelism, communication bottleneck throttles the
efficiency of model training. Recent studies adopt versatile gradient compression …

OF-WFBP: A near-optimal communication mechanism for tensor fusion in distributed deep learning

Y Gao, Z Zhang, B Hu, AL Jin, C Wu - Parallel Computing, 2023 - Elsevier
The communication bottleneck has severely restricted the scalability of distributed deep
learning. Tensor fusion improves the scalability of data parallelism by overlapping …

Host-driven In-Network Aggregation on RDMA

Y Li, W Li, Y Yao, Y Du, K Li - IEEE INFOCOM 2024-IEEE …, 2024 - ieeexplore.ieee.org
Large-scale datacenter networks are increasingly using in-network aggregation (INA) and
remote direct memory access (RDMA) techniques to accelerate deep neural network (DNN) …

AOCC-FL: Federated Learning with Aligned Overlapping via Calibrated Compensation

H Wang, W Xu, Y Fan, R Li… - IEEE INFOCOM 2023-IEEE …, 2023 - ieeexplore.ieee.org
Federated Learning enables collaboratively model training among a number of distributed
devices with the coordination of a centralized server, where each device alternatively …

Libra: Contention-Aware GPU Thread Allocation for Data Parallel Training in High Speed Networks

Y Liu, B Jiang, S Zhao, T Lin, X Wang… - IEEE INFOCOM 2023 …, 2023 - ieeexplore.ieee.org
Overlapping gradient communication with backward computation is a popular technique to
reduce communication cost in the widely adopted data parallel S-SGD training. However …

Tackling the Communication Bottlenecks of Distributed Deep Learning Training Workloads

CY Ho - 2023 - repository.kaust.edu.sa
Deep Neural Networks (DNNs) find widespread applications across various domains,
including computer vision, recommendation systems, and natural language processing …

[PDF][PDF] Accelerating Deep Neural Network Training on Optical Interconnect Systems

F Dai - 2023 - ourarchive.otago.ac.nz
As deep learning (DL) algorithms evolve and data volumes expand, training deep neural
networks (DNNs) has become essential across various domains, delivering unprecedented …