Optimus: an efficient dynamic resource scheduler for deep learning clusters

R Mayer, HA Jacobsen - ACM Computing Surveys (CSUR), 2020 - dl.acm.org

Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …

被引用次数：221 相关文章所有 8 个版本

[PDF] unitn.it

Machine learning methods for reliable resource provisioning in edge-cloud computing: A survey

TL Duc, RG Leiva, P Casari, PO Östberg - ACM Computing Surveys …, 2019 - dl.acm.org

Large-scale software systems are currently designed as distributed entities and deployed in
cloud data centers. To overcome the limitations inherent to this type of deployment …

被引用次数：207 相关文章所有 9 个版本

[PDF] usenix.org

{MLaaS} in the wild: Workload analysis and scheduling in {Large-Scale} heterogeneous {GPU} clusters

Q Weng, W Xiao, Y Yu, W Wang, C Wang, J He… - … USENIX Symposium on …, 2022 - usenix.org

With the sustained technological advances in machine learning (ML) and the availability of
massive datasets recently, tech companies are deploying large ML-as-a-Service (MLaaS) …

被引用次数：181 相关文章所有 3 个版本

[PDF] u-aizu.ac.jp

A learning-based incentive mechanism for federated learning

Y Zhan, P Li, Z Qu, D Zeng… - IEEE Internet of Things …, 2020 - ieeexplore.ieee.org

Internet of Things (IoT) generates large amounts of data at the network edge. Machine
learning models are often built on these data, to enable the detection, classification, and …

被引用次数：489 相关文章所有 6 个版本

[PDF] usenix.org

Gandiva: Introspective cluster scheduling for deep learning

W Xiao, R Bhardwaj, R Ramjee, M Sivathanu… - … USENIX Symposium on …, 2018 - usenix.org

We introduce Gandiva, a new cluster scheduling framework that utilizes domain-specific
knowledge to improve latency and efficiency of training deep learning models in a GPU …

被引用次数：503 相关文章所有 12 个版本

[PDF] usenix.org

Tiresias: A {GPU} cluster manager for distributed deep learning

J Gu, M Chowdhury, KG Shin, Y Zhu, M Jeon… - … USENIX Symposium on …, 2019 - usenix.org

Deep learning (DL) training jobs bring some unique challenges to existing cluster
managers, such as unpredictable training times, an all-or-nothing execution model, and …

被引用次数：374 相关文章所有 14 个版本

[PDF] usenix.org

Analysis of {Large-Scale}{Multi-Tenant}{GPU} clusters for {DNN} training workloads

M Jeon, S Venkataraman, A Phanishayee… - 2019 USENIX Annual …, 2019 - usenix.org

With widespread advances in machine learning, a number of large enterprises are
beginning to incorporate machine learning models across a number of products. These …

被引用次数：344 相关文章所有 11 个版本

[PDF] yibozhu.com

A generic communication scheduler for distributed DNN training acceleration

Y Peng, Y Zhu, Y Chen, Y Bao, B Yi, C Lan… - Proceedings of the 27th …, 2019 - dl.acm.org

We present ByteScheduler, a generic communication scheduler for distributed DNN training
acceleration. ByteScheduler is based on our principled analysis that partitioning and …

被引用次数：323 相关文章所有 7 个版本

[PDF] usenix.org

{MArk}: Exploiting cloud services for {Cost-Effective},{SLO-Aware} machine learning inference serving

C Zhang, M Yu, W Wang, F Yan - 2019 USENIX Annual Technical …, 2019 - usenix.org

The advances of Machine Learning (ML) have sparked a growing demand of ML-as-a-
Service: developers train ML models and publish them in the cloud as online services to …

被引用次数：264 相关文章所有 19 个版本

[PDF] usenix.org

Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning

A Qiao, SK Choe, SJ Subramanya… - … on Operating Systems …, 2021 - usenix.org

Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-
optimizing inter-dependent factors both at the per-job level and at the cluster-wide level …

被引用次数：139 相关文章所有 16 个版本

高级搜索

QQ 群