Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision

W Gao, Q Hu, Z Ye, P Sun, X Wang, Y Luo… - arXiv preprint arXiv …, 2022 - arxiv.org
Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …

Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling

S Jayaram Subramanya, D Arfeen, S Lin… - Proceedings of the 29th …, 2023 - dl.acm.org
The Sia scheduler efficiently assigns heterogeneous deep learning (DL) cluster resources to
elastic resource-adaptive jobs. Although some recent schedulers address one aspect or …

Miso: exploiting multi-instance gpu capability on multi-tenant gpu clusters

B Li, T Patel, S Samsi, V Gadepally… - Proceedings of the 13th …, 2022 - dl.acm.org
GPU technology has been improving at an expedited pace in terms of size and performance,
empowering HPC and AI/ML researchers to advance the scientific discovery process …

Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

Beware of Fragmentation: Scheduling {GPU-Sharing} Workloads with Fragmentation Gradient Descent

Q Weng, L Yang, Y Yu, W Wang, X Tang… - 2023 USENIX Annual …, 2023 - usenix.org
Large tech companies are piling up a massive number of GPUs in their server fleets to run
diverse machine learning (ML) workloads. However, these expensive devices often suffer …

Chronus: A novel deadline-aware scheduler for deep learning training jobs

W Gao, Z Ye, P Sun, Y Wen, T Zhang - … of the ACM Symposium on Cloud …, 2021 - dl.acm.org
Modern GPU clusters support Deep Learning training (DLT) jobs in a distributed manner.
Job scheduling is the key to improve the training performance, resource utilization and …

AI-enabling workloads on large-scale GPU-accelerated system: Characterization, opportunities, and implications

B Li, R Arora, S Samsi, T Patel… - … Symposium on High …, 2022 - ieeexplore.ieee.org
Production high-performance computing (HPC) systems are adopting and integrating GPUs
into their design to accommodate artificial intelligence (AI), machine learning, and data …

Task placement and resource allocation for edge machine learning: a GNN-based multi-agent reinforcement learning paradigm

Y Li, X Zhang, T Zeng, J Duan, C Wu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Machine learning (ML) tasks are one of the major workloads in today's edge computing
networks. Existing edge-cloud schedulers allocate the requested amounts of resources to …

Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs

Q Hu, M Zhang, P Sun, Y Wen, T Zhang - Proceedings of the 28th ACM …, 2023 - dl.acm.org
While recent deep learning workload schedulers exhibit excellent performance, it is arduous
to deploy them in practice due to some substantial defects, including inflexible intrusive …

Toward Sustainable HPC: Carbon Footprint Estimation and Environmental Implications of HPC Systems

B Li, R Basu Roy, D Wang, S Samsi… - Proceedings of the …, 2023 - dl.acm.org
The rapid growth in demand for HPC systems has led to a rise in carbon footprint, which
requires urgent intervention. In this work, we present a comprehensive analysis of the …