Elan: Towards generic and efficient elastic training for deep learning

W Gao, Q Hu, Z Ye, P Sun, X Wang, Y Luo… - arXiv preprint arXiv …, 2022 - arxiv.org

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …

被引用次数：23 相关文章所有 3 个版本

[PDF] arxiv.org

Oobleck: Resilient distributed training of large models using pipeline templates

I Jang, Z Yang, Z Zhang, X Jin… - Proceedings of the 29th …, 2023 - dl.acm.org

Oobleck enables resilient distributed training of large DNN models with guaranteed fault
tolerance. It takes a planning-execution co-design approach, where it first generates a set of …

被引用次数：15 相关文章所有 5 个版本

[PDF] acm.org

Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org

Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

被引用次数：3 相关文章所有 3 个版本

[PDF] github.io

ElasticFlow: An elastic serverless training platform for distributed deep learning

D Gu, Y Zhao, Y Zhong, Y Xiong, Z Han… - Proceedings of the 28th …, 2023 - dl.acm.org

This paper proposes ElasticFlow, an elastic serverless training platform for distributed deep
learning. ElasticFlow provides a serverless interface with two distinct features:(i) users …

被引用次数：11 相关文章所有 4 个版本

Astraea: A fair deep learning scheduler for multi-tenant gpu clusters

Z Ye, P Sun, W Gao, T Zhang, X Wang… - … on Parallel and …, 2021 - ieeexplore.ieee.org

Modern GPU clusters are designed to support distributed Deep Learning jobs from multiple
tenants concurrently. Each tenant may have varied and dynamic resource demands …

被引用次数：12 相关文章所有 3 个版本

[PDF] acm.org

Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters

Z Mo, H Xu, C Xu - Proceedings of the 29th ACM International …, 2024 - dl.acm.org

Modern GPU clusters inherently exhibit heterogeneity, encompassing various aspects such
as computation and communication. This heterogeneity poses a significant challenge for the …

被引用次数：1 相关文章

[PDF] arxiv.org

Unicron: Economizing self-healing llm training at scale

T He, X Li, Z Wang, K Qian, J Xu, W Yu… - arXiv preprint arXiv …, 2023 - arxiv.org

Training large-scale language models is increasingly critical in various domains, but it is
hindered by frequent failures, leading to significant time and economic costs. Current failure …

被引用次数：1 相关文章所有 2 个版本

DeepBoot: Dynamic Scheduling System for Training and Inference Deep Learning Tasks in GPU Cluster

Z Chen, X Zhao, C Zhi, J Yin - IEEE Transactions on Parallel …, 2023 - ieeexplore.ieee.org

Deep learning tasks (DLT) include training and inference tasks, where training DLTs have
requirements on minimizing average job completion time (JCT) and inference tasks need …

被引用次数：1 相关文章所有 4 个版本

[PDF] arxiv.org

Transom: An efficient fault-tolerant system for training llms

B Wu, L Xia, Q Li, K Li, X Chen, Y Guo, T Xiang… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLMs) represented by chartGPT have achieved profound
applications and breakthroughs in various fields. This demonstrates that LLMs with …

被引用次数：2 相关文章所有 2 个版本

A Comprehensive Study of Deep Learning and Performance Comparison of Deep Neural Network Models (YOLO, RetinaNet).

NI Nife, M Chtourou - International Journal of Online & …, 2023 - search.ebscohost.com

This paper presents the latest advances in machine learning techniques and highlights
deep learning (DL) methods in recent studies. This technology has recently received great …

被引用次数：1 相关文章

高级搜索

QQ 群