- 学术资源搜索

Deep learning workload scheduling in gpu datacenters: A survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org

Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

被引用次数：20 相关文章所有 4 个版本

[PDF] arxiv.org

Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision

W Gao, Q Hu, Z Ye, P Sun, X Wang, Y Luo… - arXiv preprint arXiv …, 2022 - arxiv.org

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …

被引用次数：35 相关文章所有 3 个版本

[PDF] acm.org

Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling

S Jayaram Subramanya, D Arfeen, S Lin… - Proceedings of the 29th …, 2023 - dl.acm.org

The Sia scheduler efficiently assigns heterogeneous deep learning (DL) cluster resources to
elastic resource-adaptive jobs. Although some recent schedulers address one aspect or …

被引用次数：43 相关文章所有 5 个版本

[PDF] github.io

[PDF][PDF] MAST: Global scheduling of ML training across Geo-Distributed datacenters at hyperscale

A Choudhury, Y Wang, T Pelkonen… - 18th USENIX …, 2024 - yangwang83.github.io

In public clouds, users must manually select a datacenter region to upload their ML training
data and launch ML training workloads in the same region to ensure data and computation …

被引用次数：10 相关文章所有 3 个版本

Artificial Intelligence's new clothes? A system technology perspective

S Vannuccini, E Prytkova - Journal of Information Technology, 2024 - journals.sagepub.com

In this paper, we offer an original framework to study Artificial Intelligence (AI). The
perspective we propose is based on the idea that AI is a system technology, and that a …

被引用次数：24 相关文章所有 2 个版本

[PDF] usenix.org

Beware of Fragmentation: Scheduling {GPU-Sharing} Workloads with Fragmentation Gradient Descent

Q Weng, L Yang, Y Yu, W Wang, X Tang… - 2023 USENIX Annual …, 2023 - usenix.org

Large tech companies are piling up a massive number of GPUs in their server fleets to run
diverse machine learning (ML) workloads. However, these expensive devices often suffer …

被引用次数：44 相关文章所有 10 个版本

[PDF] yibozhu.com

Lyra: Elastic scheduling for deep learning clusters

J Li, H Xu, Y Zhu, Z Liu, C Guo, C Wang - Proceedings of the Eighteenth …, 2023 - dl.acm.org

Organizations often build separate training and inference clusters for deep learning, and use
separate schedulers to manage them. This leads to problems for both: inference clusters …

被引用次数：28 相关文章所有 6 个版本

[PDF] acm.org

Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs

Q Hu, M Zhang, P Sun, Y Wen, T Zhang - Proceedings of the 28th ACM …, 2023 - dl.acm.org

While recent deep learning workload schedulers exhibit excellent performance, it is arduous
to deploy them in practice due to some substantial defects, including inflexible intrusive …

被引用次数：25 相关文章所有 2 个版本

[PDF] acm.org

Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections

M Wagenländer, G Li, B Zhao, L Mai… - Proceedings of the ACM …, 2024 - dl.acm.org

Deep learning (DL) jobs use multi-dimensional parallelism, ie, combining data, model, and
pipeline parallelism, to use large GPU clusters efficiently. Long-running jobs may …

被引用次数：3 相关文章所有 4 个版本

Silod: A co-design of caching and scheduling for deep learning clusters

H Zhao, Z Han, Z Yang, Q Zhang, M Li, F Yang… - Proceedings of the …, 2023 - dl.acm.org

Deep learning training on cloud platforms usually follows the tradition of the separation of
storage and computing. The training executes on a compute cluster equipped with …

被引用次数：20 相关文章

高级搜索

QQ 群

Deep learning workload scheduling in gpu datacenters: A survey

Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision

Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling

[PDF][PDF] MAST: Global scheduling of ML training across Geo-Distributed datacenters at hyperscale

Artificial Intelligence's new clothes? A system technology perspective

Beware of Fragmentation: Scheduling {GPU-Sharing} Workloads with Fragmentation Gradient Descent

Lyra: Elastic scheduling for deep learning clusters

Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs

Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections

Silod: A co-design of caching and scheduling for deep learning clusters

引用