Resiliency in numerical algorithm design for extreme scale simulations

E Agullo, M Altenbernd, H Anzt… - … Journal of High …, 2022 - journals.sagepub.com
This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …

Task-level resilience: checkpointing vs. supervision

J Posner, L Reitz, C Fohry - International Journal of Networking and …, 2022 - jstage.jst.go.jp
With the advent of exascale computing, issues such as application irregularity and
permanent hardware failure are growing in importance. Irregularity is often addressed by …

Task-level checkpointing for nested fork-join programs using work stealing

L Reitz, C Fohry - European Conference on Parallel Processing, 2023 - Springer
Recent Exascale supercomputers consist of millions of processing units, and this number is
still growing. Therefore, hardware failures, such as permanent node failures, become …

Checkpointing and localized recovery for nested fork-join programs

C Fohry - arXiv preprint arXiv:2102.12941, 2021 - arxiv.org
While checkpointing is typically combined with a restart of the whole application, localized
recovery permits all but the affected processes to continue. In task-based cluster …

[HTML][HTML] Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

L Reitz, C Fohry - SN Computer Science, 2024 - Springer
Exascale supercomputers consist of millions of processing units, and this number is still
growing. Therefore, hardware failures, such as permanent node failures, become …

Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach

M Whitlock, N Morales, G Bosilca… - 2022 IEEE …, 2022 - ieeexplore.ieee.org
Integrating recent advancements in resilient algorithms and techniques into existing codes is
a singular challenge in fault tolerance-in part due to the underlying complexity of …

Towards distributed software resilience in asynchronous many-task programming models

N Gupta, JR Mayo, AS Lemoine… - 2020 IEEE/ACM 10th …, 2020 - ieeexplore.ieee.org
Exceptions and errors occurring within mission critical applications due to hardware failures
have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …

Improving scalability of silent-error resilience for message-passing solvers via local recovery and asynchrony

H Kolla, JR Mayo, K Teranishi… - 2020 IEEE/ACM 10th …, 2020 - ieeexplore.ieee.org
Benefits of local recovery (restarting only a failed process or task) have been previously
demonstrated in parallel solvers. Local recovery has a reduced impact on application …

Load Balancing, Fault Tolerance, and Resource Elasticity for Asynchronous Many-Task Systems

J Posner - 2021 - kobra.uni-kassel.de
Abstract High-Performance Computing (HPC) enables solving complex problems from
various scientific fields including key societal problems such as COVID-19. Recently …

Resiliency in Numerical Algorithm Design for Extreme Scale Simulations (Dagstuhl Seminar 20101)

L Giraud, U Rüde, L Stals - 2020 - drops.dagstuhl.de
This work is based on the seminar titled" Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations" held March 1-6, 2020 at Schloss Dagstuhl, that was attended by …