Enabling resilience in asynchronous many-task programming models

E Agullo, M Altenbernd, H Anzt… - … Journal of High …, 2022 - journals.sagepub.com

This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …

被引用次数：9 相关文章所有 22 个版本

[PDF] jst.go.jp

Task-level resilience: checkpointing vs. supervision

J Posner, L Reitz, C Fohry - International Journal of Networking and …, 2022 - jstage.jst.go.jp

With the advent of exascale computing, issues such as application irregularity and
permanent hardware failure are growing in importance. Irregularity is often addressed by …

被引用次数：6 相关文章所有 7 个版本

Task-level checkpointing for nested fork-join programs using work stealing

L Reitz, C Fohry - European Conference on Parallel Processing, 2023 - Springer

Recent Exascale supercomputers consist of millions of processing units, and this number is
still growing. Therefore, hardware failures, such as permanent node failures, become …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Checkpointing and localized recovery for nested fork-join programs

C Fohry - arXiv preprint arXiv:2102.12941, 2021 - arxiv.org

While checkpointing is typically combined with a restart of the whole application, localized
recovery permits all but the affected processes to continue. In task-based cluster …

被引用次数：6 相关文章所有 2 个版本

[HTML] springer.com

[HTML][HTML] Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

L Reitz, C Fohry - SN Computer Science, 2024 - Springer

Exascale supercomputers consist of millions of processing units, and this number is still
growing. Therefore, hardware failures, such as permanent node failures, become …

Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach

M Whitlock, N Morales, G Bosilca… - 2022 IEEE …, 2022 - ieeexplore.ieee.org

Integrating recent advancements in resilient algorithms and techniques into existing codes is
a singular challenge in fault tolerance-in part due to the underlying complexity of …

被引用次数：3 相关文章所有 8 个版本

[PDF] arxiv.org

Towards distributed software resilience in asynchronous many-task programming models

N Gupta, JR Mayo, AS Lemoine… - 2020 IEEE/ACM 10th …, 2020 - ieeexplore.ieee.org

Exceptions and errors occurring within mission critical applications due to hardware failures
have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …

被引用次数：4 相关文章所有 6 个版本

[PDF] osti.gov

Improving scalability of silent-error resilience for message-passing solvers via local recovery and asynchrony

H Kolla, JR Mayo, K Teranishi… - 2020 IEEE/ACM 10th …, 2020 - ieeexplore.ieee.org

Benefits of local recovery (restarting only a failed process or task) have been previously
demonstrated in parallel solvers. Local recovery has a reduced impact on application …

被引用次数：6 相关文章所有 4 个版本

[PDF] uni-kassel.de

Load Balancing, Fault Tolerance, and Resource Elasticity for Asynchronous Many-Task Systems

J Posner - 2021 - kobra.uni-kassel.de

Abstract High-Performance Computing (HPC) enables solving complex problems from
various scientific fields including key societal problems such as COVID-19. Recently …

被引用次数：2 相关文章

[PDF] dagstuhl.de

Resiliency in Numerical Algorithm Design for Extreme Scale Simulations (Dagstuhl Seminar 20101)

L Giraud, U Rüde, L Stals - 2020 - drops.dagstuhl.de

This work is based on the seminar titled" Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations" held March 1-6, 2020 at Schloss Dagstuhl, that was attended by …

被引用次数：3 相关文章所有 3 个版本

高级搜索

QQ 群