Characterization and prediction of deep learning workloads in large-scale gpu datacenters

W Gao, Q Hu, Z Ye, P Sun, X Wang, Y Luo… - arXiv preprint arXiv …, 2022 - arxiv.org

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …

被引用次数：23 相关文章所有 3 个版本

[PDF] acm.org

Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling

S Jayaram Subramanya, D Arfeen, S Lin… - Proceedings of the 29th …, 2023 - dl.acm.org

The Sia scheduler efficiently assigns heterogeneous deep learning (DL) cluster resources to
elastic resource-adaptive jobs. Although some recent schedulers address one aspect or …

被引用次数：12 相关文章所有 5 个版本

[PDF] arxiv.org

Miso: exploiting multi-instance gpu capability on multi-tenant gpu clusters

B Li, T Patel, S Samsi, V Gadepally… - Proceedings of the 13th …, 2022 - dl.acm.org

GPU technology has been improving at an expedited pace in terms of size and performance,
empowering HPC and AI/ML researchers to advance the scientific discovery process …

被引用次数：32 相关文章所有 5 个版本

[PDF] acm.org

Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org

Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

被引用次数：3 相关文章所有 3 个版本

[PDF] usenix.org

Beware of Fragmentation: Scheduling {GPU-Sharing} Workloads with Fragmentation Gradient Descent

Q Weng, L Yang, Y Yu, W Wang, X Tang… - 2023 USENIX Annual …, 2023 - usenix.org

Large tech companies are piling up a massive number of GPUs in their server fleets to run
diverse machine learning (ML) workloads. However, these expensive devices often suffer …

被引用次数：10 相关文章所有 9 个版本

[PDF] yezhisheng.me

Chronus: A novel deadline-aware scheduler for deep learning training jobs

W Gao, Z Ye, P Sun, Y Wen, T Zhang - … of the ACM Symposium on Cloud …, 2021 - dl.acm.org

Modern GPU clusters support Deep Learning training (DLT) jobs in a distributed manner.
Job scheduling is the key to improve the training performance, resource utilization and …

被引用次数：29 相关文章所有 4 个版本

[PDF] google.com

AI-enabling workloads on large-scale GPU-accelerated system: Characterization, opportunities, and implications

B Li, R Arora, S Samsi, T Patel… - … Symposium on High …, 2022 - ieeexplore.ieee.org

Production high-performance computing (HPC) systems are adopting and integrating GPUs
into their design to accommodate artificial intelligence (AI), machine learning, and data …

被引用次数：20 相关文章所有 4 个版本

[PDF] arxiv.org

Task placement and resource allocation for edge machine learning: a GNN-based multi-agent reinforcement learning paradigm

Y Li, X Zhang, T Zeng, J Duan, C Wu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Machine learning (ML) tasks are one of the major workloads in today's edge computing
networks. Existing edge-cloud schedulers allocate the requested amounts of resources to …

被引用次数：11 相关文章所有 9 个版本

[PDF] acm.org

Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs

Q Hu, M Zhang, P Sun, Y Wen, T Zhang - Proceedings of the 28th ACM …, 2023 - dl.acm.org

While recent deep learning workload schedulers exhibit excellent performance, it is arduous
to deploy them in practice due to some substantial defects, including inflexible intrusive …

被引用次数：12 相关文章所有 3 个版本

[PDF] acm.org

Toward Sustainable HPC: Carbon Footprint Estimation and Environmental Implications of HPC Systems

B Li, R Basu Roy, D Wang, S Samsi… - Proceedings of the …, 2023 - dl.acm.org

The rapid growth in demand for HPC systems has led to a rise in carbon footprint, which
requires urgent intervention. In this work, we present a comprehensive analysis of the …

被引用次数：8 相关文章所有 2 个版本

高级搜索

QQ 群