Resilient n-body tree computations with algorithm-based focused recovery: Model and performance analysis

A Cavelan, A Fang, AA Chien, Y Robert - High Performance Computing …, 2018 - Springer
This paper presents a model and performance study for Algorithm-Based Focused Recovery
(ABFR) applied to N-body computations, subject to latent errors. We make a detailed …

Computing the expected makespan of task graphs in the presence of silent errors

H Casanova, J Herrmann, Y Robert - Parallel Computing, 2018 - Elsevier
Abstract Applications structured as Directed Acyclic Graphs (DAGs) of tasks occur in many
domains, including popular scientific workflows. DAG scheduling has thus received an …

Two-level checkpointing and verifications for linear task graphs

A Benoit, A Cavelan, Y Robert… - 2016 IEEE International …, 2016 - ieeexplore.ieee.org
Fail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience
techniques must accommodate both error sources. To cope with the double challenge, a two …

High performance storage system design using emerging storage technologies

J Yang - 2022 - search.proquest.com
In the past few decades, data volume increases exponentially. Smart devices, social media,
and e-business generate an extremely amount of data everyday. While big data is promising …

A novel approach for handling soft error in conjugate gradients

ME Ozturk, M Renardy, Y Li, G Agrawal… - 2018 IEEE 25th …, 2018 - ieeexplore.ieee.org
Soft errors or bit flips have recently become an important challenge in high performance
computing. In this paper, we focus on soft errors in a particular algorithm: conjugate …

Two-level checkpointing and partial verifications for linear task graphs

A Benoit, A Cavelan, Y Robert… - 6th International Workshop …, 2015 - inria.hal.science
Fail-stop and silent errors are unavoidable on large-scale platforms. Efficient resilience
techniques must accommodate both error sources. A traditional checkpointing and rollback …

[PDF][PDF] Harpocrates: Breaking the Silence of CPU Faults through Hardware-in-the-Loop Program Generation

N Karystinos, O Chatzopoulos, GM Fragkoulis… - users.uoa.gr
Several hyperscalers have recently disclosed the occurrence of Silent Data Corruptions
(SDCs) in their systems fleets, sparking concerns about the severity of known and the …

Scalable algorithmic detection of silent data corruption for high-dimensional pdes

AP Hinojosa, HJ Bungartz, D Pflüger - Sparse Grids and Applications …, 2018 - Springer
In this paper we show how to benefit from the numerical properties of a well-established
extrapolation method—the combination technique—to make it tolerant to silent data …

Reliability for exascale computing: system modelling and error mitigation for task-parallel HPC applications

O Subasi - 2016 - upcommons.upc.edu
As high performance computing (HPC) systems continue to grow, their fault rate increases.
Applications running on these systems have to deal with rates on the order of hours or days …

Bi-source verification against silent data corruption in high performance computing

EA Krluku, M Gusev, V Zdraveski - … of the 9th Balkan Conference on …, 2019 - dl.acm.org
This paper proposes a continuous health-check approach for detecting Silent Data
Corruption (SCD) in High Performance Computing (HPC) environments. The goal is to …