Gemini: Fast failure recovery in distributed training with in-memory checkpoints

M Xu, W Yin, D Cai, R Yi, D Xu, Q Wang, B Wu… - arXiv preprint arXiv …, 2024 - arxiv.org

Large foundation models, including large language models (LLMs), vision transformers
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …

被引用次数：20 相关文章所有 3 个版本

[PDF] usenix.org

Parcae: Proactive,{Liveput-Optimized}{DNN} Training on Preemptible Instances

J Duan, Z Song, X Miao, X Xi, D Lin, H Xu… - … USENIX Symposium on …, 2024 - usenix.org

Deep neural networks (DNNs) are becoming progressively large and costly to train. This
paper aims to reduce DNN training costs by leveraging preemptible instances on modern …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Unicron: Economizing self-healing llm training at scale

T He, X Li, Z Wang, K Qian, J Xu, W Yu… - arXiv preprint arXiv …, 2023 - arxiv.org

Training large-scale language models is increasingly critical in various domains, but it is
hindered by frequent failures, leading to significant time and economic costs. Current failure …

被引用次数：1 相关文章所有 2 个版本

[PDF] acm.org

Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures

T Gupta, S Krishnan, R Kumar, A Vijeev… - Proceedings of the …, 2024 - dl.acm.org

Deep Learning training jobs process large amounts of training data using many GPU
devices, often running for weeks or months. When hardware or software failures happen …

[PDF] arxiv.org

SlipStream: Adapting Pipelines for Distributed Training of Large DNNs Amid Failures

S Gandhi, M Zhao, A Skiadopoulos… - arXiv preprint arXiv …, 2024 - arxiv.org

Training large Deep Neural Network (DNN) models requires thousands of GPUs for days or
weeks at a time. At these scales, failures are frequent and can have a big impact on training …

相关文章所有 2 个版本

[PDF] arxiv.org

Token-wise Influential Training Data Retrieval for Large Language Models

H Lin, J Long, Z Xu, W Zhao - arXiv preprint arXiv:2405.11724, 2024 - arxiv.org

Given a Large Language Model (LLM) generation, how can we identify which training data
led to this generation? In this paper, we proposed RapidIn, a scalable framework adapting to …

相关文章所有 2 个版本

[PDF] arxiv.org

TENPLEX: Changing Resources of Deep Learning Jobs using Parallelizable Tensor Collections

M Wagenländer, G Li, B Zhao, L Mai… - arXiv preprint arXiv …, 2023 - arxiv.org

Deep learning (DL) jobs use multi-dimensional parallelism, ie they combine data, model,
and pipeline parallelism, to use large GPU clusters efficiently. This couples jobs tightly to a …

相关文章所有 2 个版本

[PDF] arxiv.org

PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation

Z Huang, X Wei, Y Hao, R Chen, M Han, J Gu… - arXiv preprint arXiv …, 2024 - arxiv.org

Checkpointing (C) and restoring (R) are key components for GPU tasks. POS is an OS-level
GPU C/R system: It can transparently checkpoint or restore processes that use the GPU …

相关文章所有 2 个版本

[PDF] arxiv.org

DeepVM: Integrating Spot and On-Demand VMs for Cost-Efficient Deep Learning Clusters in the Cloud

Y Kim, K Kim, Y Cho, J Kim, A Khan, KD Kang… - arXiv preprint arXiv …, 2024 - arxiv.org

Distributed Deep Learning (DDL), as a paradigm, dictates the use of GPU-based clusters as
the optimal infrastructure for training large-scale Deep Neural Networks (DNNs). However …

Building efficient and practical machine learning systems

Q Hu - 2023 - dr.ntu.edu.sg

With the widespread adoption of deep learning (DL) applications in recent years, training DL
models has become increasingly prevalent. Nevertheless, training these models is typically …

高级搜索

QQ 群