Chronus: A novel deadline-aware scheduler for deep learning training jobs

W Gao, Q Hu, Z Ye, P Sun, X Wang, Y Luo… - arXiv preprint arXiv …, 2022 - arxiv.org

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …

被引用次数：23 相关文章所有 3 个版本

[PDF] acm.org

Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org

Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

被引用次数：3 相关文章所有 3 个版本

[PDF] osti.gov

Elastic resource management for deep learning applications in a container cluster

Y Mao, V Sharma, W Zheng, L Cheng… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org

The increasing demand for learning from massive datasets is restructuring our economy.
Effective learning, however, involves nontrivial computing resources. Most businesses utilize …

被引用次数：19 相关文章所有 5 个版本

[PDF] acm.org

Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs

Q Hu, M Zhang, P Sun, Y Wen, T Zhang - Proceedings of the 28th ACM …, 2023 - dl.acm.org

While recent deep learning workload schedulers exhibit excellent performance, it is arduous
to deploy them in practice due to some substantial defects, including inflexible intrusive …

被引用次数：12 相关文章所有 3 个版本

[PDF] github.io

ElasticFlow: An elastic serverless training platform for distributed deep learning

D Gu, Y Zhao, Y Zhong, Y Xiong, Z Han… - Proceedings of the 28th …, 2023 - dl.acm.org

This paper proposes ElasticFlow, an elastic serverless training platform for distributed deep
learning. ElasticFlow provides a serverless interface with two distinct features:(i) users …

被引用次数：11 相关文章所有 4 个版本

Astraea: A fair deep learning scheduler for multi-tenant gpu clusters

Z Ye, P Sun, W Gao, T Zhang, X Wang… - … on Parallel and …, 2021 - ieeexplore.ieee.org

Modern GPU clusters are designed to support distributed Deep Learning jobs from multiple
tenants concurrently. Each tenant may have varied and dynamic resource demands …

被引用次数：12 相关文章所有 3 个版本

Hydra: Deadline-aware and efficiency-oriented scheduling for deep learning jobs on heterogeneous gpus

Z Yang, H Wu, Y Xu, Y Wu, H Zhong… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

With the rapid proliferation of deep learning (DL) jobs running on heterogeneous GPUs,
scheduling DL jobs to meet various scheduling requirements, such as meeting deadlines …

被引用次数：7 相关文章所有 3 个版本

[PDF] acm.org

Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters

Z Mo, H Xu, C Xu - Proceedings of the 29th ACM International …, 2024 - dl.acm.org

Modern GPU clusters inherently exhibit heterogeneity, encompassing various aspects such
as computation and communication. This heterogeneity poses a significant challenge for the …

被引用次数：1 相关文章

[PDF] acm.org

AutoSched: An Adaptive Self-configured Framework for Scheduling Deep Learning Training Workloads

W Gao, X Zhang, S Huang, S Guo, P Sun… - Proceedings of the 38th …, 2024 - dl.acm.org

Modern Deep Learning Training (DLT) schedulers in GPU datacenters are designed to be
very sophisticated with many configurations. These configurations need to be adjusted …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

F Liang, Z Zhang, H Lu, V Leung, Y Guo… - arXiv preprint arXiv …, 2024 - arxiv.org

With the rapid growth in the volume of data sets, models, and devices in the domain of deep
learning, there is increasing attention on large-scale distributed deep learning. In contrast to …

被引用次数：1 相关文章所有 2 个版本

高级搜索

QQ 群