A Survey on Scheduling Techniques in Computing and Network Convergence

S Tang, Y Yu, H Wang, G Wang, W Chen… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org
The computing demand for massive applications has led to the ubiquitous deployment of
computing power. This trend results in the urgent need for higher-level computing resource …

DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster

Z Chen, X Zhao, C Zhi, J Yin - IEEE Transactions on Parallel …, 2023 - ieeexplore.ieee.org
Deep learning tasks (DLT) include training and inference tasks, where training DLTs have
requirements on minimizing average job completion time (JCT) and inference tasks need …

Enabling switch memory management for distributed training with in-network aggregation

B Zhao, C Liu, J Dong, Z Cao, W Nie… - IEEE INFOCOM 2023 …, 2023 - ieeexplore.ieee.org
Distributed training (DT) in shared clusters usually deploys a scheduler for resource
allocation to multiple concurrent jobs. Meanwhile, a recent acceleration primitive, In-Network …

Towards GPU Memory Efficiency for Distributed Training at Scale

R Cheng, C Cai, S Yilmaz, R Mitra, M Bag… - Proceedings of the …, 2023 - dl.acm.org
The scale of deep learning models has grown tremendously in recent years. State-of-the-art
models have reached billions of parameters and terabyte-scale model sizes. Training of …

Tear Up the Bubble Boom: Lessons Learned From a Deep Learning Research and Development Cluster

Z Yang, Z Ye, T Fu, J Luo, X Wei, Y Luo… - 2022 IEEE 40th …, 2022 - ieeexplore.ieee.org
With the proliferation of deep learning, there exists a strong need to efficiently operate GPU
clusters for deep learning production in giant AI companies, as well as for research and …

A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud via Reinforcement Learning

M Xing, H Mao, S Yin, L Pan, Z Zhang, Z Xiao… - Proceedings of the 29th …, 2023 - dl.acm.org
Public cloud GPU clusters are becoming emerging platforms for training distributed deep
learning jobs. Under this training paradigm, the job scheduler is a crucial component to …

Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey

F Liang, Z Zhang, H Lu, C Li, V Leung, Y Guo… - arXiv preprint arXiv …, 2024 - arxiv.org
With rapidly increasing distributed deep learning workloads in large-scale data centers,
efficient distributed deep learning framework strategies for resource allocation and workload …

UniSched: A Unified Scheduler for Deep Learning Training Jobs with Different User Demands

W Gao, Z Ye, P Sun, T Zhang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
The growth of deep learning training (DLT) jobs in modern GPU clusters calls for efficient
deep learning (DL) scheduler designs. Due to the extensive applications of DL technology …

vTrain: A Simulation Framework for Evaluating Cost-effective and Compute-optimal Large Language Model Training

J Bang, Y Choi, M Kim, Y Kim, M Rhu - arXiv preprint arXiv:2312.12391, 2023 - arxiv.org
As large language models (LLMs) become widespread in various application domains, a
critical challenge the AI community is facing is how to train these large AI models in a cost …

On a Meta Learning-based Scheduler for Deep Learning Clusters

J Yang, L Bao, W Liu, R Yang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Deep learning (DL) has become a dominating type of workloads on AI computing platforms.
The performance of such platforms highly depends on how distributed DL jobs are …