H He, S Yu - Proceedings of Machine Learning and Systems, 2023 - proceedings.mlsys.org
Gradient checkpointing is an optimization that reduces the memory footprint by re-computing some operations instead of saving their activations. Previous works on checkpointing have …
J Posner - 2020 IEEE International Conference on Cluster …, 2020 - ieeexplore.ieee.org
Fault tolerance is becoming increasingly important since the probability of permanent hardware failures increases with machine size. A typical resilience approach to fail/stop …
B Nicolae, F Cappello - … of the 22nd international symposium on High …, 2013 - dl.acm.org
With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence, which makes reliability a difficult challenge …
As we approach the era of exa-scale computing, fault tolerance is of growing importance. The increasing number of cores as well as the increased complexity of modern …
Deep learning (DL) applications are increasingly being deployed on HPC systems, to leverage the massive parallelism and computing power of those systems for DL model …
Deep learning is slowly, but steadily, hitting a memory bottleneck. While the tensor computation in top-of-the-line GPUs increased by 32x over the last five years, the total …
In the age of big data, deep learning has emerged as a powerful tool to extract insight and exploit its value, both in industry and scientific applications. One common pattern emerging …
Y Verdeja Herms, Y Li - Proceedings of the 2019 on Great Lakes …, 2019 - dl.acm.org
We present a lightweight technique to minimize error recovery costs in approximate computing environments. We take advantage of the key observation that if an application …
J Liao, M Li, H Yang, Q Sun, B Sun… - 2023 IEEE …, 2023 - ieeexplore.ieee.org
Larger deep learning models usually lead to higher model quality, however with an ever- increasing GPU memory footprint. Although several tensor checkpointing techniques have …