Learning scheduling algorithms for data processing clusters

H Mao, M Schwarzkopf, SB Venkatakrishnan… - Proceedings of the …, 2019 - dl.acm.org
Efficiently scheduling data processing jobs on distributed compute clusters requires complex
algorithms. Current systems use simple, generalized heuristics and ignore workload …

Gandiva: Introspective cluster scheduling for deep learning

W Xiao, R Bhardwaj, R Ramjee, M Sivathanu… - … USENIX Symposium on …, 2018 - usenix.org
We introduce Gandiva, a new cluster scheduling framework that utilizes domain-specific
knowledge to improve latency and efficiency of training deep learning models in a GPU …

Beyond data and model parallelism for deep neural networks.

Z Jia, M Zaharia, A Aiken - Proceedings of Machine Learning …, 2019 - proceedings.mlsys.org
Existing deep learning systems commonly parallelize deep neural network (DNN) training
using data or model parallelism, but these strategies often result in suboptimal …

Cluster resource scheduling in cloud computing: literature review and research challenges

W Khallouli, J Huang - The Journal of supercomputing, 2022 - Springer
Scheduling plays a pivotal role in cloud computing systems. Designing an efficient
scheduler is a challenging task. The challenge comes from several aspects, including the …

Tiresias: A {GPU} cluster manager for distributed deep learning

J Gu, M Chowdhury, KG Shin, Y Zhu, M Jeon… - … USENIX Symposium on …, 2019 - usenix.org
Deep learning (DL) training jobs bring some unique challenges to existing cluster
managers, such as unpredictable training times, an all-or-nothing execution model, and …

Optimus: an efficient dynamic resource scheduler for deep learning clusters

Y Peng, Y Bao, Y Chen, C Wu, C Guo - Proceedings of the Thirteenth …, 2018 - dl.acm.org
Deep learning workloads are common in today's production clusters due to the proliferation
of deep learning driven AI services (eg, speech recognition, machine translation). A deep …

[HTML][HTML] Optimized container scheduling for data-intensive serverless edge computing

T Rausch, A Rashed, S Dustdar - Future Generation Computer Systems, 2021 - Elsevier
Operating data-intensive applications on edge systems is challenging, due to the extreme
workload and device heterogeneity, as well as the geographic dispersion of compute and …

Serving {DNNs} like clockwork: Performance predictability from the bottom up

A Gujarati, R Karimi, S Alzayat, W Hao… - … USENIX Symposium on …, 2020 - usenix.org
Machine learning inference is becoming a core building block for interactive web
applications. As a result, the underlying model serving systems on which these applications …

Faster and cheaper serverless computing on harvested resources

Y Zhang, Í Goiri, GI Chaudhry, R Fonseca… - Proceedings of the …, 2021 - dl.acm.org
Serverless computing is becoming increasingly popular due to its ease of programming, fast
elasticity, and fine-grained billing. However, the serverless provider still needs to provision …

Protean:{VM} allocation service at scale

O Hadary, L Marshall, I Menache, A Pan… - … USENIX Symposium on …, 2020 - usenix.org
We describe the design and implementation of Protean--the Microsoft Azure service
responsible for allocating Virtual Machines (VMs) to millions of servers around the globe. A …