相关文章- 学术资源搜索

teaMPI—replication-based resilience without the (performance) pain

P Samfass, T Weinzierl, B Hazelwood… - … Conference, ISC High …, 2020 - Springer

In an era where we can not afford to checkpoint frequently, replication is a generic way
forward to construct numerical simulations that can continue to run even if hardware parts …

被引用次数：12 相关文章所有 11 个版本

[PDF] dagstuhl.de

Resiliency in Numerical Algorithm Design for Extreme Scale Simulations (Dagstuhl Seminar 20101)

L Giraud, U Rüde, L Stals - 2020 - drops.dagstuhl.de

This work is based on the seminar titled" Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations" held March 1-6, 2020 at Schloss Dagstuhl, that was attended by …

被引用次数：3 相关文章所有 3 个版本

[PDF] arxiv.org

Resiliency in numerical algorithm design for extreme scale simulations

E Agullo, M Altenbernd, H Anzt… - … Journal of High …, 2022 - journals.sagepub.com

This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …

被引用次数：9 相关文章所有 22 个版本

[PDF] osti.gov

Performance efficient multiresilience using checkpoint recovery in iterative algorithms

RA Ashraf, C Engelmann - Euro-Par 2018: Parallel Processing Workshops …, 2019 - Springer

In this paper, we address the design challenge of building multiresilient iterative high-
performance computing (HPC) applications. Multiresilience in HPC applications is the ability …

被引用次数：3 相关文章所有 9 个版本

Using replication for resilience on exascale systems

H Casanova, F Vivien, D Zaidouni - Fault-Tolerance Techniques for High …, 2015 - Springer

High-performance computing applications must be resilient to faults. The traditional fault
tolerance solution is checkpoint–recovery, by which application state is saved to and …

被引用次数：14 相关文章所有 5 个版本

[PDF] ku.edu

Selective Protection for Sparse Iterative Solvers to Reduce the Resilience Overhead

H Sun, A Gainaru, M Shantharam… - 2020 IEEE 32nd …, 2020 - ieeexplore.ieee.org

The increasing scale and complexity of today's high-performance computing (HPC) systems
demand a renewed focus on enhancing the resilience of long-running scientific applications …

被引用次数：3 相关文章所有 9 个版本

[PDF] unm.edu

Keeping checkpoint/restart viable for exascale systems

K Ferreira - 2011 - digitalrepository.unm.edu

Next-generation exascale systems, those capable of performing a quintillion operations per
second, are expected to be delivered in the next 8-10 years. These systems, which will be …

被引用次数：13 相关文章所有 4 个版本

[PDF] pitt.edu

Power-aware resilience for exascale computing

B Mills - 2014 - search.proquest.com

To enable future scientific breakthroughs and discoveries, the next generation of scientific
applications will require exascale computing performance to support the execution of …

被引用次数：3 相关文章所有 3 个版本

[PDF] osti.gov

Improving scalability of silent-error resilience for message-passing solvers via local recovery and asynchrony

H Kolla, JR Mayo, K Teranishi… - 2020 IEEE/ACM 10th …, 2020 - ieeexplore.ieee.org

Benefits of local recovery (restarting only a failed process or task) have been previously
demonstrated in parallel solvers. Local recovery has a reduced impact on application …

被引用次数：6 相关文章所有 4 个版本

[PDF] upc.edu

Exploiting asynchrony from exact forward recovery for due in iterative solvers

L Jaulmes, M Casas, M Moretó, E Ayguadé… - Proceedings of the …, 2015 - dl.acm.org

This paper presents a method to protect iterative solvers from Detected and Uncorrected
Errors (DUE) relying on error detection techniques already available in commodity …

被引用次数：44 相关文章所有 10 个版本

高级搜索

QQ 群