Optimus: an efficient dynamic resource scheduler for deep learning clusters

Y Peng, Y Bao, Y Chen, C Wu, C Guo - Proceedings of the Thirteenth …, 2018 - dl.acm.org
Deep learning workloads are common in today's production clusters due to the proliferation
of deep learning driven AI services (eg, speech recognition, machine translation). A deep …

Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision

W Gao, Q Hu, Z Ye, P Sun, X Wang, Y Luo… - arXiv preprint arXiv …, 2022 - arxiv.org
Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …

Tiresias: A {GPU} cluster manager for distributed deep learning

J Gu, M Chowdhury, KG Shin, Y Zhu, M Jeon… - … USENIX Symposium on …, 2019 - usenix.org
Deep learning (DL) training jobs bring some unique challenges to existing cluster
managers, such as unpredictable training times, an all-or-nothing execution model, and …

Deep learning-based job placement in distributed machine learning clusters

Y Bao, Y Peng, C Wu - IEEE INFOCOM 2019-IEEE conference …, 2019 - ieeexplore.ieee.org
Production machine learning (ML) clusters commonly host a variety of distributed ML
workloads, eg, speech recognition, machine translation. While server sharing among jobs …

Online job scheduling in distributed machine learning clusters

Y Bao, Y Peng, C Wu, Z Li - IEEE INFOCOM 2018-IEEE …, 2018 - ieeexplore.ieee.org
Nowadays large-scale distributed machine learning systems have been deployed to support
various analytics and intelligence services in IT firms. To train a large dataset and derive the …

Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

Dl2: A deep learning-driven scheduler for deep learning clusters

Y Peng, Y Bao, Y Chen, C Wu… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Efficient resource scheduling is essential for maximal utilization of expensive deep learning
(DL) clusters. Existing cluster schedulers either are agnostic to machine learning (ML) …

Gadget: Online resource optimization for scheduling ring-all-reduce learning jobs

M Yu, Y Tian, B Ji, C Wu, H Rajan… - IEEE INFOCOM 2022 …, 2022 - ieeexplore.ieee.org
Fueled by advances in distributed deep learning (DDL), recent years have witnessed a
rapidly growing demand for resource-intensive distributed/parallel computing to process …

Elastic parameter server load distribution in deep learning clusters

Y Chen, Y Peng, Y Bao, C Wu, Y Zhu… - Proceedings of the 11th …, 2020 - dl.acm.org
In distributed DNN training, parameter servers (PS) can become performance bottlenecks
due to PS stragglers, caused by imbalanced parameter distribution, bandwidth contention …

Job scheduling for large-scale machine learning clusters

H Wang, Z Liu, H Shen - … of the 16th International Conference on …, 2020 - dl.acm.org
With the rapid proliferation of Machine Learning (ML) and Deep learning (DL) applications
running on modern platforms, it is crucial to satisfy application performance requirements …