PipePar: Enabling fast DNN pipeline parallel training in heterogeneous GPU clusters

J Zhang, G Niu, Q Dai, H Li, Z Wu, F Dong, Z Wu - Neurocomputing, 2023 - Elsevier
Recently, pipeline parallelism for large-scale Deep Neural Network (DNN) training has been
developed, which partitions the DNN model across multiple devices (eg, GPUs) and …

Pipe-torch: Pipeline-based distributed deep learning in a gpu cluster with heterogeneous networking

J Zhan, J Zhang - … Conference on Advanced Cloud and Big …, 2019 - ieeexplore.ieee.org
Because training a deep neural network (DNN) takes arduous amounts of time and
computation, often researchers expedite the training process via distributed parallel training …

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

J Zhang, J Zhan, J Li, J Jin… - … and Computation: Practice …, 2020 - Wiley Online Library
Exorbitant resources (computing and memory) are required to train a deep neural network
(DNN). Often researchers deploy an approach that uses distributed parallel training to …

Pipedream: Fast and efficient pipeline parallel dnn training

A Harlap, D Narayanan, A Phanishayee… - arXiv preprint arXiv …, 2018 - arxiv.org
PipeDream is a Deep Neural Network (DNN) training system for GPUs that parallelizes
computation by pipelining execution across multiple machines. Its pipeline parallel …

Hippie: A data-paralleled pipeline approach to improve memory-efficiency and scalability for large dnn training

X Ye, Z Lai, S Li, L Cai, D Sun, L Qiao, D Li - Proceedings of the 50th …, 2021 - dl.acm.org
With the increase of both data and parameter volume, it has become a big challenge to
efficiently train large-scale DNN models on distributed platforms. Ordinary parallelism …

Hph: Hybrid parallelism on heterogeneous clusters for accelerating large-scale dnns training

Y Duan, Z Lai, S Li, W Liu, K Ge… - 2022 IEEE International …, 2022 - ieeexplore.ieee.org
As the deep learning model grows larger, training model with a single computational
resource becomes impractical. To solve this, hybrid parallelism, which combines data and …

BaPipe: Exploration of balanced pipeline parallelism for DNN training

L Zhao, R Xu, T Wang, T Tian, X Wang, W Wu… - arXiv preprint arXiv …, 2020 - arxiv.org
The size of deep neural networks (DNNs) grows rapidly as the complexity of the machine
learning algorithm increases. To satisfy the requirement of computation and memory of DNN …

{HetPipe}: Enabling large {DNN} training on (whimpy) heterogeneous {GPU} clusters through integration of pipelined model parallelism and data parallelism

JH Park, G Yun, MY Chang, NT Nguyen, S Lee… - 2020 USENIX Annual …, 2020 - usenix.org
Deep Neural Network (DNN) models have continuously been growing in size in order to
improve the accuracy and quality of the models. Moreover, for training of large DNN models …

Efficient and robust parallel dnn training through model parallelism on multi-gpu platform

CC Chen, CL Yang, HY Cheng - arXiv preprint arXiv:1809.02839, 2018 - arxiv.org
The training process of Deep Neural Network (DNN) is compute-intensive, often taking days
to weeks to train a DNN model. Therefore, parallel execution of DNN training on GPUs is a …

Parallelizing DNN training on GPUs: Challenges and opportunities

W Xu, Y Zhang, X Tang - … Proceedings of the Web Conference 2021, 2021 - dl.acm.org
In recent years, Deep Neural Networks (DNNs) have emerged as a widely adopted
approach in many application domains. Training DNN models is also becoming a significant …