Beware of Fragmentation: Scheduling {GPU-Sharing} Workloads with Fragmentation Gradient Descent

Q Weng, L Yang, Y Yu, W Wang, X Tang… - 2023 USENIX Annual …, 2023 - usenix.org
Large tech companies are piling up a massive number of GPUs in their server fleets to run
diverse machine learning (ML) workloads. However, these expensive devices often suffer …

Salus: Fine-grained gpu sharing primitives for deep learning applications

P Yu, M Chowdhury - arXiv preprint arXiv:1902.04610, 2019 - arxiv.org
GPU computing is becoming increasingly more popular with the proliferation of deep
learning (DL) applications. However, unlike traditional resources such as CPU or the …

Themis: Fair and efficient {GPU} cluster scheduling

K Mahajan, A Balasubramanian, A Singhvi… - … USENIX Symposium on …, 2020 - usenix.org
Modern distributed machine learning (ML) training workloads benefit significantly from
leveraging GPUs. However, significant contention ensues when multiple such workloads are …

{MLaaS} in the wild: Workload analysis and scheduling in {Large-Scale} heterogeneous {GPU} clusters

Q Weng, W Xiao, Y Yu, W Wang, C Wang, J He… - … USENIX Symposium on …, 2022 - usenix.org
With the sustained technological advances in machine learning (ML) and the availability of
massive datasets recently, tech companies are deploying large ML-as-a-Service (MLaaS) …

Schedtune: A heterogeneity-aware gpu scheduler for deep learning

H Albahar, S Dongare, Y Du, N Zhao… - 2022 22nd IEEE …, 2022 - ieeexplore.ieee.org
Modern cluster management systems, such as Kubernetes, support heterogeneous
workloads and resources. However, existing resource schedulers in these systems do not …

Analysis of {Large-Scale}{Multi-Tenant}{GPU} clusters for {DNN} training workloads

M Jeon, S Venkataraman, A Phanishayee… - 2019 USENIX Annual …, 2019 - usenix.org
With widespread advances in machine learning, a number of large enterprises are
beginning to incorporate machine learning models across a number of products. These …

Fine-grained GPU sharing primitives for deep learning applications

P Yu, M Chowdhury - Proceedings of Machine Learning and …, 2020 - proceedings.mlsys.org
Unlike traditional resources such as CPU or the network, modern GPUs do not natively
support fine-grained sharing primitives. Consequently, implementing common policies such …

Online evolutionary batch size orchestration for scheduling deep learning workloads in GPU clusters

Z Bian, S Li, W Wang, Y You - … of the International Conference for High …, 2021 - dl.acm.org
Efficient GPU resource scheduling is essential to maximize resource utilization and save
training costs for the increasing amount of deep learning workloads in shared GPU clusters …

CODA: Improving resource utilization by slimming and co-locating DNN and CPU jobs

H Zhao, W Cui, Q Chen, J Leng, K Yu… - 2020 IEEE 40th …, 2020 - ieeexplore.ieee.org
While deep neural network (DNN) models are often trained on GPUs, many companies and
research institutes build GPU clusters that are shared by different groups. On such GPU …

[PDF][PDF] Multi-tenant GPU clusters for deep learning workloads: Analysis and implications

M Jeon, S Venkataraman, J Qian… - Technical report …, 2018 - microsoft.com
With widespread advances in machine learning, a number of large enterprises are
beginning to incorporate machine learning models across a number of products. These …