Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures

T Gupta, S Krishnan, R Kumar, A Vijeev… - Proceedings of the …, 2024 - dl.acm.org
Deep Learning training jobs process large amounts of training data using many GPU
devices, often running for weeks or months. When hardware or software failures happen …

Transcending runtime-memory tradeoffs in checkpointing by being fusion aware

H He, S Yu - Proceedings of Machine Learning and Systems, 2023 - proceedings.mlsys.org
Gradient checkpointing is an optimization that reduces the memory footprint by re-computing
some operations instead of saving their activations. Previous works on checkpointing have …

System-level vs. application-level checkpointing

J Posner - 2020 IEEE International Conference on Cluster …, 2020 - ieeexplore.ieee.org
Fault tolerance is becoming increasingly important since the probability of permanent
hardware failures increases with machine size. A typical resilience approach to fail/stop …

AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing

B Nicolae, F Cappello - … of the 22nd international symposium on High …, 2013 - dl.acm.org
With increasing scale and complexity of supercomputing and cloud computing architectures,
faults are becoming a frequent occurrence, which makes reliability a difficult challenge …

Checkpoint restart support for heterogeneous hpc applications

K Parasyris, K Keller… - 2020 20th IEEE/ACM …, 2020 - ieeexplore.ieee.org
As we approach the era of exa-scale computing, fault tolerance is of growing importance.
The increasing number of cores as well as the increased complexity of modern …

A study of checkpointing in large scale training of deep neural networks

E Rojas, AN Kahira, E Meneses, LB Gomez… - arXiv preprint arXiv …, 2020 - arxiv.org
Deep learning (DL) applications are increasingly being deployed on HPC systems, to
leverage the massive parallelism and computing power of those systems for DL model …

Memory optimization for deep networks

A Shah, CY Wu, J Mohan, V Chidambaram… - arXiv preprint arXiv …, 2020 - arxiv.org
Deep learning is slowly, but steadily, hitting a memory bottleneck. While the tensor
computation in top-of-the-line GPUs increased by 32x over the last five years, the total …

Deepfreeze: Towards scalable asynchronous checkpointing of deep learning models

B Nicolae, J Li, JM Wozniak, G Bosilca… - 2020 20th IEEE/ACM …, 2020 - ieeexplore.ieee.org
In the age of big data, deep learning has emerged as a powerful tool to extract insight and
exploit its value, both in industry and scientific applications. One common pattern emerging …

Crash skipping: A minimal-cost framework for efficient error recovery in approximate computing environments

Y Verdeja Herms, Y Li - Proceedings of the 2019 on Great Lakes …, 2019 - dl.acm.org
We present a lightweight technique to minimize error recovery costs in approximate
computing environments. We take advantage of the key observation that if an application …

Exploiting Input Tensor Dynamics in Activation Checkpointing for Efficient Training on GPU

J Liao, M Li, H Yang, Q Sun, B Sun… - 2023 IEEE …, 2023 - ieeexplore.ieee.org
Larger deep learning models usually lead to higher model quality, however with an ever-
increasing GPU memory footprint. Although several tensor checkpointing techniques have …