HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis

S Zhang, L Diao, C Wu, Z Cao, S Wang… - Proceedings of the …, 2024 - dl.acm.org
Single-Program-Multiple-Data (SPMD) parallelism has recently been adopted to train large
deep neural networks (DNNs). Few studies have explored its applicability on heterogeneous …

Enabling foundation models: A distributed collaboration framework based on graph federated learning

J Chen, S Guo, Q Qi, J Hao, S Guo… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Foundation models (FMs), known as pre-trained models, have garnered significant interest
in Industrial Internet due to their remarkable performance and robust generalization …

Reliable data transmission for a VANET-IoIT architecture: A DNN approach

J Ghosh, N Kumar, KA Al-Utaibi, SM Sait, C So-In - Internet of Things, 2024 - Elsevier
The challenges and resilience of vehicular ad hoc network (VANET) and deep neural
network (DNN) hybrid architectures in terms of reliability in smart cities have attracted much …

Adaptive partitioning and efficient scheduling for distributed DNN training in heterogeneous IoT environment

B Huang, X Huang, X Liu, C Ding, Y Yin… - Computer …, 2024 - Elsevier
With the increasing proliferation of Internet-of-Things (IoT) devices, it is a growing trend
toward training a deep neural network (DNN) model in pipeline parallelism across resource …

OWL: Worker-Assisted Server Bandwidth Optimization for Efficient Communication Federated Learning

X Han, B Liu, C Hu, D Cheng - Journal of Parallel and Distributed …, 2024 - Elsevier
Edge computing in federated learning based on centralized architecture often faces
communication constraints in large clusters. Although there have been some efforts like …

A Survey on Performance Modeling and Prediction for Distributed DNN Training

Z Guo, Y Tang, J Zhai, T Yuan, J Jin… - … on Parallel and …, 2024 - ieeexplore.ieee.org
The recent breakthroughs in large-scale DNN attract significant attention from both
academia and industry toward distributed DNN training techniques. Due to the time …

[PDF][PDF] MACHINE LEARNING SYSTEMS IN CONSTRAINED ENVIRONMENTS

B JEON - 2024 - dprg.cs.uiuc.edu
Machine learning (ML) training and inference systems encounter constraints in current
computation environments due to increased ML model sizes, the fast-growing popularity of …