Towards distributed machine learning in shared clusters: A dynamically-partitioned approach

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org

Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

被引用次数：4 相关文章所有 4 个版本

[PDF] arxiv.org

Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision

W Gao, Q Hu, Z Ye, P Sun, X Wang, Y Luo… - arXiv preprint arXiv …, 2022 - arxiv.org

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …

被引用次数：23 相关文章所有 3 个版本

[PDF] usenix.org

Tiresias: A {GPU} cluster manager for distributed deep learning

J Gu, M Chowdhury, KG Shin, Y Zhu, M Jeon… - … USENIX Symposium on …, 2019 - usenix.org

Deep learning (DL) training jobs bring some unique challenges to existing cluster
managers, such as unpredictable training times, an all-or-nothing execution model, and …

被引用次数：377 相关文章所有 13 个版本

[PDF] kaust.edu.sa

Optimus: an efficient dynamic resource scheduler for deep learning clusters

Y Peng, Y Bao, Y Chen, C Wu, C Guo - Proceedings of the Thirteenth …, 2018 - dl.acm.org

Deep learning workloads are common in today's production clusters due to the proliferation
of deep learning driven AI services (eg, speech recognition, machine translation). A deep …

被引用次数：466 相关文章所有 3 个版本

[PDF] hku.hk

Deep learning-based job placement in distributed machine learning clusters

Y Bao, Y Peng, C Wu - IEEE INFOCOM 2019-IEEE conference …, 2019 - ieeexplore.ieee.org

Production machine learning (ML) clusters commonly host a variety of distributed ML
workloads, eg, speech recognition, machine translation. While server sharing among jobs …

被引用次数：139 相关文章所有 6 个版本

[PDF] arxiv.org

Online job scheduling in distributed machine learning clusters

Y Bao, Y Peng, C Wu, Z Li - IEEE INFOCOM 2018-IEEE …, 2018 - ieeexplore.ieee.org

Nowadays large-scale distributed machine learning systems have been deployed to support
various analytics and intelligence services in IT firms. To train a large dataset and derive the …

被引用次数：123 相关文章所有 9 个版本

[PDF] arxiv.org

DL2: A deep learning-driven scheduler for deep learning clusters

Y Peng, Y Bao, Y Chen, C Wu… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org

Efficient resource scheduling is essential for maximal utilization of expensive deep learning
(DL) clusters. Existing cluster schedulers either are agnostic to machine learning (ML) …

被引用次数：81 相关文章所有 6 个版本

[PDF] google.com

Elastic parameter server load distribution in deep learning clusters

Y Chen, Y Peng, Y Bao, C Wu, Y Zhu… - Proceedings of the 11th …, 2020 - dl.acm.org

In distributed DNN training, parameter servers (PS) can become performance bottlenecks
due to PS stragglers, caused by imbalanced parameter distribution, bandwidth contention …

被引用次数：36 相关文章所有 3 个版本

[PDF] hku.hk

Deep learning-based job placement in distributed machine learning clusters with heterogeneous workloads

Y Bao, Y Peng, C Wu - IEEE/ACM Transactions on Networking, 2022 - ieeexplore.ieee.org

Nowadays, most leading IT companies host a variety of distributed machine learning (ML)
workloads in ML clusters to support AI-driven services, such as speech recognition, machine …

被引用次数：12 相关文章所有 6 个版本

[PDF] arxiv.org

Gadget: Online resource optimization for scheduling ring-all-reduce learning jobs

M Yu, Y Tian, B Ji, C Wu, H Rajan… - IEEE INFOCOM 2022 …, 2022 - ieeexplore.ieee.org

Fueled by advances in distributed deep learning (DDL), recent years have witnessed a
rapidly growing demand for resource-intensive distributed/parallel computing to process …

被引用次数：18 相关文章所有 11 个版本

高级搜索

QQ 群