Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

F Liang, Z Zhang, H Lu, V Leung, Y Guo… - arXiv preprint arXiv …, 2024 - arxiv.org
With the rapid growth in the volume of data sets, models, and devices in the domain of deep
learning, there is increasing attention on large-scale distributed deep learning. In contrast to …

Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters

Z Mo, H Xu, C Xu - Proceedings of the 29th ACM International …, 2024 - dl.acm.org
Modern GPU clusters inherently exhibit heterogeneity, encompassing various aspects such
as computation and communication. This heterogeneity poses a significant challenge for the …

AutoSched: An Adaptive Self-configured Framework for Scheduling Deep Learning Training Workloads

W Gao, X Zhang, S Huang, S Guo, P Sun… - Proceedings of the 38th …, 2024 - dl.acm.org
Modern Deep Learning Training (DLT) schedulers in GPU datacenters are designed to be
very sophisticated with many configurations. These configurations need to be adjusted …

Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey

F Liang, Z Zhang, H Lu, C Li, V Leung, Y Guo… - arXiv preprint arXiv …, 2024 - arxiv.org
With rapidly increasing distributed deep learning workloads in large-scale data centers,
efficient distributed deep learning framework strategies for resource allocation and workload …

UniSched: A Unified Scheduler for Deep Learning Training Jobs with Different User Demands

W Gao, Z Ye, P Sun, T Zhang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
The growth of deep learning training (DLT) jobs in modern GPU clusters calls for efficient
deep learning (DL) scheduler designs. Due to the extensive applications of DL technology …

Non-Clairvoyant Scheduling of Distributed Machine Learning with Inter-job and Intra-job Parallelism on Heterogeneous GPUs

F Chen, P Li, C Wu, S Guo - IEEE Transactions on Cloud …, 2024 - ieeexplore.ieee.org
Distributed machine learning (DML) has shown great promise in accelerating model training
on multiple GPUs. To increase GPU utilization, a common practice is to let multiple learning …

GPU Cluster Scheduling for Network-Sensitive Deep Learning

A Sharma, VM Bhasi, S Singh, G Kesidis… - arXiv preprint arXiv …, 2024 - arxiv.org
We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables
proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the …

A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters

C Xue, W Cui, H Zhao, Q Chen, S Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Joint consideration of scheduling and adaptive parallelism offers great opportunities for
improving the training efficiency of large models on heterogeneous GPU clusters. However …

Towards providing reliable job completion time predictions using PCS

AB Faisal, N Martin, HM Bashir, S Lamelas… - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper we build a case for providing job completion time predictions to cloud users,
similar to the delivery date of a package or arrival time of a booked ride. Our analysis reveals …