We introduce Gandiva, a new cluster scheduling framework that utilizes domain-specific knowledge to improve latency and efficiency of training deep learning models in a GPU …
Z Jia, M Zaharia, A Aiken - Proceedings of Machine Learning …, 2019 - proceedings.mlsys.org
Existing deep learning systems commonly parallelize deep neural network (DNN) training using data or model parallelism, but these strategies often result in suboptimal …
Scheduling plays a pivotal role in cloud computing systems. Designing an efficient scheduler is a challenging task. The challenge comes from several aspects, including the …
Deep learning (DL) training jobs bring some unique challenges to existing cluster managers, such as unpredictable training times, an all-or-nothing execution model, and …
Deep learning workloads are common in today's production clusters due to the proliferation of deep learning driven AI services (eg, speech recognition, machine translation). A deep …
T Rausch, A Rashed, S Dustdar - Future Generation Computer Systems, 2021 - Elsevier
Operating data-intensive applications on edge systems is challenging, due to the extreme workload and device heterogeneity, as well as the geographic dispersion of compute and …
Machine learning inference is becoming a core building block for interactive web applications. As a result, the underlying model serving systems on which these applications …
Serverless computing is becoming increasingly popular due to its ease of programming, fast elasticity, and fine-grained billing. However, the serverless provider still needs to provision …
We describe the design and implementation of Protean--the Microsoft Azure service responsible for allocating Virtual Machines (VMs) to millions of servers around the globe. A …