Deep learning workload scheduling in gpu datacenters: A survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

A survey on scheduling techniques in computing and network convergence

S Tang, Y Yu, H Wang, G Wang, W Chen… - … Surveys & Tutorials, 2023 - ieeexplore.ieee.org
The computing demand for massive applications has led to the ubiquitous deployment of
computing power. This trend results in the urgent need for higher-level computing resource …

{MLaaS} in the wild: Workload analysis and scheduling in {Large-Scale} heterogeneous {GPU} clusters

Q Weng, W Xiao, Y Yu, W Wang, C Wang, J He… - … USENIX Symposium on …, 2022 - usenix.org
With the sustained technological advances in machine learning (ML) and the availability of
massive datasets recently, tech companies are deploying large ML-as-a-Service (MLaaS) …

Scaling distributed machine learning with {In-Network} aggregation

A Sapio, M Canini, CY Ho, J Nelson, P Kalnis… - … USENIX Symposium on …, 2021 - usenix.org
Training machine learning models in parallel is an increasingly important workload. We
accelerate distributed parallel training by designing a communication primitive that uses a …

Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning

A Qiao, SK Choe, SJ Subramanya… - … on Operating Systems …, 2021 - usenix.org
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-
optimizing inter-dependent factors both at the per-job level and at the cluster-wide level …

Characterization and prediction of deep learning workloads in large-scale gpu datacenters

Q Hu, P Sun, S Yan, Y Wen, T Zhang - Proceedings of the International …, 2021 - dl.acm.org
Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services
in both the research community and industry. When operating a datacenter, optimization of …

Looking beyond {GPUs} for {DNN} scheduling on {Multi-Tenant} clusters

J Mohan, A Phanishayee, J Kulkarni… - … USENIX Symposium on …, 2022 - usenix.org
Training Deep Neural Networks (DNNs) is a popular workload in both enterprises and cloud
data centers. Existing schedulers for DNN training consider GPU as the dominant resource …

Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling

S Jayaram Subramanya, D Arfeen, S Lin… - Proceedings of the 29th …, 2023 - dl.acm.org
The Sia scheduler efficiently assigns heterogeneous deep learning (DL) cluster resources to
elastic resource-adaptive jobs. Although some recent schedulers address one aspect or …

Liquid: Intelligent resource estimation and network-efficient scheduling for deep learning jobs on distributed GPU clusters

R Gu, Y Chen, S Liu, H Dai, G Chen… - … on Parallel and …, 2021 - ieeexplore.ieee.org
Deep learning (DL) is becoming increasingly popular in many domains, including computer
vision, speech recognition, self-driving automobiles, etc. GPU can train DL models efficiently …

Multi-resource interleaving for deep learning training

Y Zhao, Y Liu, Y Peng, Y Zhu, X Liu, X Jin - Proceedings of the ACM …, 2022 - dl.acm.org
Training Deep Learning (DL) model requires multiple resource types, including CPUs,
GPUs, storage IO, and network IO. Advancements in DL have produced a wide spectrum of …