Task-level checkpointing for nested fork-join programs using work stealing

L Reitz, C Fohry - European Conference on Parallel Processing, 2023 - Springer
Recent Exascale supercomputers consist of millions of processing units, and this number is
still growing. Therefore, hardware failures, such as permanent node failures, become …

Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

L Reitz, C Fohry - SN Computer Science, 2024 - Springer
Exascale supercomputers consist of millions of processing units, and this number is still
growing. Therefore, hardware failures, such as permanent node failures, become …

Checkpointing and localized recovery for nested fork-join programs

C Fohry - arXiv preprint arXiv:2102.12941, 2021 - arxiv.org
While checkpointing is typically combined with a restart of the whole application, localized
recovery permits all but the affected processes to continue. In task-based cluster …

Task-level resilience: checkpointing vs. supervision

J Posner, L Reitz, C Fohry - International Journal of Networking and …, 2022 - jstage.jst.go.jp
With the advent of exascale computing, issues such as application irregularity and
permanent hardware failure are growing in importance. Irregularity is often addressed by …

Checkpointing vs. supervision resilience approaches for dynamic independent tasks

J Posner, L Reitz, C Fohry - 2021 IEEE International Parallel …, 2021 - ieeexplore.ieee.org
With the advent of exascale computing, issues such as application irregularity and
permanent hardware failure are growing in importance. Irregularity is often addressed by …

Efficient checkpointing of multi-threaded applications as a tool for debugging, performance tuning, and resiliency

M Grossman, V Sarkar - 2016 IEEE International Parallel and …, 2016 - ieeexplore.ieee.org
Past work on application checkpointing systems has either focused on enabling application
resiliency or as a tool for debugging (as in record-replay literature). Each of these use cases …

Application-level checkpointing for shared memory programs

G Bronevetsky, D Marques, K Pingali, P Szwed… - ACM SIGPLAN …, 2004 - dl.acm.org
Trends in high-performance computing are making it necessary for long-running
applications to tolerate hardware faults. The most commonly used approach is checkpoint …

[PDF][PDF] Linux support for transparent checkpointing of multithreaded programs

CD Carothers, BK Szymanski - To appear in Dr. Dobbs Journal, 2002 - oelzant.priv.at
The most common use of checkpointing is in fault tolerant computing where the goal is to
minimize loss of CPU cycles when a long executing program crashes before completion. By …

An application-level checkpointing based on extended data flow analysis for openMP programs

HY Fu, Y Ding, W Song, XJ Yang - Jisuanji Xuebao(Chinese Journal of …, 2010 - cjc.ict.ac.cn
As the wide application of multi-core processor architecture in the domain of high
performance computing, fault tolerance for shared memory parallel programs becomes a hot …

Rebound: scalable checkpointing for coherent shared memory

R Agarwal, P Garg, J Torrellas - Proceedings of the 38th annual …, 2011 - dl.acm.org
As we move to large manycores, the hardware-based global check-pointing schemes that
have been proposed for small shared-memory machines do not scale. Scalability barriers …