Kube-knots: Resource harvesting through dynamic container orchestration in gpu-based datacenters

P Thinakaran, JR Gunasekaran… - … on cluster computing …, 2019 - ieeexplore.ieee.org
Compute heterogeneity is increasingly gaining prominence in modern datacenters due to
the addition of accelerators like GPUs and FPGAs. We observe that datacenter schedulers …

The curious case of container orchestration and scheduling in gpu-based datacenters

P Thinakaran, J Raj, B Sharma, MT Kandemir… - Proceedings of the …, 2018 - dl.acm.org
Modern data centers are increasingly being provisioned with compute accelerators such as
GPUs, FPGAs and ASIC's to catch up with the workload performance demands and reduce …

EMF: Disaggregated GPUs in datacenters for efficiency, modularity and flexibility

A Guleria, J Lakshmi, C Padala - 2019 IEEE International …, 2019 - ieeexplore.ieee.org
Disaggregating expensive and power-hungry GPUs enable a cost-efficient and adaptive
ecosystem for cloud deployment, particularly for emerging markets, wherein AI applications …

SchedTune: A heterogeneity-aware GPU scheduler for deep learning

H Albahar, S Dongare, Y Du, N Zhao… - 2022 22nd IEEE …, 2022 - ieeexplore.ieee.org
Modern cluster management systems, such as Kubernetes, support heterogeneous
workloads and resources. However, existing resource schedulers in these systems do not …

CODA: Improving resource utilization by slimming and co-locating DNN and CPU jobs

H Zhao, W Cui, Q Chen, J Leng, K Yu… - 2020 IEEE 40th …, 2020 - ieeexplore.ieee.org
While deep neural network (DNN) models are often trained on GPUs, many companies and
research institutes build GPU clusters that are shared by different groups. On such GPU …

Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision

W Gao, Q Hu, Z Ye, P Sun, X Wang, Y Luo… - arXiv preprint arXiv …, 2022 - arxiv.org
Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …

Transparent {GPU} sharing in container clouds for deep learning workloads

B Wu, Z Zhang, Z Bai, X Liu, X Jin - 20th USENIX Symposium on …, 2023 - usenix.org
Containers are widely used for resource management in datacenters. A common practice to
support deep learning (DL) training in container clouds is to statically bind GPUs to …

Quadd: Quantifying accelerator disaggregated datacenter efficiency

A Guleria, J Lakshmi, C Padala - 2019 IEEE 12th International …, 2019 - ieeexplore.ieee.org
In the current era of data explosion accelerators such as GPUs facilitate data-driven
applications with requisite compute boost. Availability of GPUs in Public Cloud offerings has …

Deep learning workload scheduling in gpu datacenters: A survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

Characterization and prediction of deep learning workloads in large-scale gpu datacenters

Q Hu, P Sun, S Yan, Y Wen, T Zhang - Proceedings of the International …, 2021 - dl.acm.org
Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services
in both the research community and industry. When operating a datacenter, optimization of …