相关文章- 学术资源搜索

Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures

T Gupta, S Krishnan, R Kumar, A Vijeev… - Proceedings of the …, 2024 - dl.acm.org

Deep Learning training jobs process large amounts of training data using many GPU
devices, often running for weeks or months. When hardware or software failures happen …

[PDF] mlsys.org

Transcending runtime-memory tradeoffs in checkpointing by being fusion aware

H He, S Yu - Proceedings of Machine Learning and Systems, 2023 - proceedings.mlsys.org

Gradient checkpointing is an optimization that reduces the memory footprint by re-computing
some operations instead of saving their activations. Previous works on checkpointing have …

被引用次数：1 相关文章

System-level vs. application-level checkpointing

J Posner - 2020 IEEE International Conference on Cluster …, 2020 - ieeexplore.ieee.org

Fault tolerance is becoming increasingly important since the probability of permanent
hardware failures increases with machine size. A typical resilience approach to fail/stop …

被引用次数：6 相关文章所有 2 个版本

[PDF] hal.science

AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing

B Nicolae, F Cappello - … of the 22nd international symposium on High …, 2013 - dl.acm.org

With increasing scale and complexity of supercomputing and cloud computing architectures,
faults are becoming a frequent occurrence, which makes reliability a difficult challenge …

被引用次数：32 相关文章所有 17 个版本

[PDF] upc.edu

Checkpoint restart support for heterogeneous hpc applications

K Parasyris, K Keller… - 2020 20th IEEE/ACM …, 2020 - ieeexplore.ieee.org

As we approach the era of exa-scale computing, fault tolerance is of growing importance.
The increasing number of cores as well as the increased complexity of modern …

被引用次数：18 相关文章所有 3 个版本

[PDF] arxiv.org

A study of checkpointing in large scale training of deep neural networks

E Rojas, AN Kahira, E Meneses, LB Gomez… - arXiv preprint arXiv …, 2020 - arxiv.org

Deep learning (DL) applications are increasingly being deployed on HPC systems, to
leverage the massive parallelism and computing power of those systems for DL model …

被引用次数：29 相关文章所有 4 个版本

[PDF] arxiv.org

Memory optimization for deep networks

A Shah, CY Wu, J Mohan, V Chidambaram… - arXiv preprint arXiv …, 2020 - arxiv.org

Deep learning is slowly, but steadily, hitting a memory bottleneck. While the tensor
computation in top-of-the-line GPUs increased by 32x over the last five years, the total …

被引用次数：24 相关文章所有 8 个版本

[PDF] osti.gov

Deepfreeze: Towards scalable asynchronous checkpointing of deep learning models

B Nicolae, J Li, JM Wozniak, G Bosilca… - 2020 20th IEEE/ACM …, 2020 - ieeexplore.ieee.org

In the age of big data, deep learning has emerged as a powerful tool to extract insight and
exploit its value, both in industry and scientific applications. One common pattern emerging …

被引用次数：49 相关文章所有 12 个版本

Crash skipping: A minimal-cost framework for efficient error recovery in approximate computing environments

Y Verdeja Herms, Y Li - Proceedings of the 2019 on Great Lakes …, 2019 - dl.acm.org

We present a lightweight technique to minimize error recovery costs in approximate
computing environments. We take advantage of the key observation that if an application …

被引用次数：6 相关文章

[PDF] ssslab.cn

Exploiting Input Tensor Dynamics in Activation Checkpointing for Efficient Training on GPU

J Liao, M Li, H Yang, Q Sun, B Sun… - 2023 IEEE …, 2023 - ieeexplore.ieee.org

Larger deep learning models usually lead to higher model quality, however with an ever-
increasing GPU memory footprint. Although several tensor checkpointing techniques have …

高级搜索

QQ 群