Algorithm-based fault tolerance for parallel stencil computations

A Cavelan, FM Ciorba - 2019 IEEE international conference on …, 2019 - ieeexplore.ieee.org
The increase in HPC systems size and complexity, together with increasing on-chip
transistor density, power limitations, and number of components, render modern HPC …

Coping with recall and precision of soft error detectors

L Bautista-Gomez, A Benoit, A Cavelan… - Journal of Parallel and …, 2016 - Elsevier
Many methods are available to detect silent errors in high-performance computing (HPC)
applications. Each method comes with a cost, a recall (fraction of all errors that are actually …

Addressing data resiliency for staging based scientific workflows

S Duan, P Subedi, PE Davis, M Parashar - Proceedings of the …, 2019 - dl.acm.org
As applications move towards extreme scales, data-related challenges are becoming
significant concerns, and in-situ workflows based on data staging and in-situ/in-transit data …

Comparative analysis of soft-error detection strategies: A case study with iterative methods

G Kestor, BO Mutlu, J Manzano, O Subasi… - Proceedings of the 15th …, 2018 - dl.acm.org
Undetected soft errors caused by transient bit flips can lead to silent data corruption (SDC),
an undesirable outcome where invalid results pass for valid ones. This has motivated the …

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer
With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …

Resilience for stencil computations with latent errors

A Fang, A Cavelan, Y Robert… - 2017 46th International …, 2017 - ieeexplore.ieee.org
Projections and measurements of error rates in near-exascale and exascale systems
suggest a dramatic growth, due to extreme scale (10^ 9 cores), concurrency, software …

Query-load balancing in structured overlays

A Datta, R Schmidt, K Aberer - Seventh IEEE International …, 2007 - ieeexplore.ieee.org
Query-load (forwarding and answering) balancing in structured overlays is one of the most
critical and least studied problems. It has been assumed that caching heuristics can take …

Identifying the right replication level to detect and correct silent errors at scale

A Benoit, A Cavelan, F Cappello, P Raghavan… - Proceedings of the …, 2017 - dl.acm.org
This paper provides a model and an analytical study of replication as a technique to detect
and correct silent errors. Although other detection techniques exist for HPC applications …

Efficient detection of silent data corruption in HPC applications with synchronization-free message verification

G Zhang, Y Liu, H Yang, D Qian - The Journal of Supercomputing, 2022 - Springer
Nowadays, high-performance computing (HPC) is stepping forward to exascale era.
However, silent data corruption (SDC) behaved as bit-flipping can cause disastrous …

EXAHD: An exa-scalable two-level sparse grid approach for higher-dimensional problems in plasma physics and beyond

M Heene, AP Hinojosa, M Obersteiner… - … Computing in Science …, 2018 - Springer
Within the current reporting period (04/2016–04/2017) of our HLRS project we have
developed a scalable implementation of the fault-tolerant combination technique. Fault …