Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision

W Gao, Q Hu, Z Ye, P Sun, X Wang, Y Luo… - arXiv preprint arXiv …, 2022 - arxiv.org
Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …

Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs

Q Hu, M Zhang, P Sun, Y Wen, T Zhang - Proceedings of the 28th ACM …, 2023 - dl.acm.org
While recent deep learning workload schedulers exhibit excellent performance, it is arduous
to deploy them in practice due to some substantial defects, including inflexible intrusive …

AutoSched: An Adaptive Self-configured Framework for Scheduling Deep Learning Training Workloads

W Gao, X Zhang, S Huang, S Guo, P Sun… - Proceedings of the 38th …, 2024 - dl.acm.org
Modern Deep Learning Training (DLT) schedulers in GPU datacenters are designed to be
very sophisticated with many configurations. These configurations need to be adjusted …

GPARS: Graph predictive algorithm for efficient resource scheduling in heterogeneous GPU clusters

S Wang, S Chen, Y Shi - Future Generation Computer Systems, 2024 - Elsevier
Efficient resource scheduling in heterogeneous graphics processing unit (GPU) clusters are
critical for maximizing system performance and optimizing resource utilization. However …

Tear Up the Bubble Boom: Lessons Learned From a Deep Learning Research and Development Cluster

Z Yang, Z Ye, T Fu, J Luo, X Wei, Y Luo… - 2022 IEEE 40th …, 2022 - ieeexplore.ieee.org
With the proliferation of deep learning, there exists a strong need to efficiently operate GPU
clusters for deep learning production in giant AI companies, as well as for research and …

A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud via Reinforcement Learning

M Xing, H Mao, S Yin, L Pan, Z Zhang, Z Xiao… - Proceedings of the 29th …, 2023 - dl.acm.org
Public cloud GPU clusters are becoming emerging platforms for training distributed deep
learning jobs. Under this training paradigm, the job scheduler is a crucial component to …

UniSched: A Unified Scheduler for Deep Learning Training Jobs with Different User Demands

W Gao, Z Ye, P Sun, T Zhang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
The growth of deep learning training (DLT) jobs in modern GPU clusters calls for efficient
deep learning (DL) scheduler designs. Due to the extensive applications of DL technology …

[HTML][HTML] NGS: A network GPGPU system for orchestrating remote and virtual accelerators

J Prades, C Reaño, F Silla - Journal of Systems Architecture, 2024 - Elsevier
Abstract In General-Purpose computing on Graphics Processing Unit (GPGPU), the use of
CPUs is combined with that of GPUs. CPUs are used for sequential code, while GPUs are …

A Stochastic Approach for Scheduling AI Training Jobs in GPU-based Systems

F Filippini, J Anselmi, D Ardagna… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
In this work, we optimize the scheduling of Deep Learning (DL) training jobs from the
perspective of a Cloud Service Provider running a data center, which efficiently selects …