Task-level checkpointing for nested fork-join programs using work stealing

L Reitz, C Fohry - European Conference on Parallel Processing, 2023 - Springer
Recent Exascale supercomputers consist of millions of processing units, and this number is
still growing. Therefore, hardware failures, such as permanent node failures, become …

Checkpointing and localized recovery for nested fork-join programs

C Fohry - arXiv preprint arXiv:2102.12941, 2021 - arxiv.org
While checkpointing is typically combined with a restart of the whole application, localized
recovery permits all but the affected processes to continue. In task-based cluster …

Doubt and redundancy kill soft errors—Towards detection and correction of silent data corruption in task-based numerical software

P Samfass, T Weinzierl, A Reinarz… - 2021 IEEE/ACM 11th …, 2021 - ieeexplore.ieee.org
Resilient algorithms in high-performance computing are subject to rigorous non-functional
constraints. Resiliency must not increase the runtime, memory footprint or I/O demands too …

[HTML][HTML] Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

L Reitz, C Fohry - SN Computer Science, 2024 - Springer
Exascale supercomputers consist of millions of processing units, and this number is still
growing. Therefore, hardware failures, such as permanent node failures, become …