A methodology for soft errors detection and automatic recovery

D Montezanti, A De Giusti, M Naiouf… - … Conference on High …, 2017 - ieeexplore.ieee.org
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals
and silent faults are expected in the future. It is projected that, in exascale systems, errors …

Designing and modelling selective replication for fault-tolerant hpc applications

O Subasi, G Yalcin, F Zyulkyarov… - 2017 17th IEEE/ACM …, 2017 - ieeexplore.ieee.org
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for
High Performance Computing (HPC) applications. There are studies that address fail-stop …

Characterizing a detection strategy for transient faults in hpc

DM Montezanti, D Rexachs del Rosario, E Rucci… - 2016 - sedici.unlp.edu.ar
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger
detection intervals and silent faults are expected in the future. It is projected that, in exascale …

Assessing HPC failure detectors for MPI jobs

K Kharbas, D Kim, T Hoefler… - 2012 20th Euromicro …, 2012 - ieeexplore.ieee.org
Reliability is one of the challenges faced by exascale computing. Components are poised to
fail during large-scale executions given current mean time between failure (MTBF) …

Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing

D Montezanti, E Rucci, A De Giusti, M Naiouf… - Future Generation …, 2020 - Elsevier
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that
silent undetected errors will occur several times a day, increasing the occurrence of …

Combining partial redundancy and checkpointing for HPC

J Elliott, K Kharbas, D Fiala, F Mueller… - 2012 IEEE 32nd …, 2012 - ieeexplore.ieee.org
Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^ 15)
floating point operations per second) and exascale systems are projected within seven …

Opportunistic application-level fault detection through adaptive redundant multithreading

S Hukerikar, PC Diniz, RF Lucas… - … Conference on High …, 2014 - ieeexplore.ieee.org
As the scale and complexity of future High Performance Computing systems continues to
grow, the rising frequency of faults and errors and their impact on HPC applications will …

A case study of designing efficient algorithm-based fault tolerant application for exascale parallelism

E Yao, R Wang, M Chen, G Tan… - 2012 IEEE 26th …, 2012 - ieeexplore.ieee.org
Fault tolerance overhead of high performance computing (HPC) applications is becoming
critical to the efficient utilization of HPC systems at large scale. Today's HPC applications …

Evaluating the error resilience of parallel programs

B Fang, K Pattabiraman, M Ripeanu… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org
As a consequence of increasing hardware fault rates, HPC systems face significant
challenges in terms of reliability. Evaluating the error resilience of HPC applications is an …

Understanding the propagation of transient errors in HPC applications

RA Ashraf, R Gioiosa, G Kestor, RF DeMara… - Proceedings of the …, 2015 - dl.acm.org
Resiliency of exascale systems has quickly become an important concern for the scientific
community. Despite its importance, still much remains to be determined regarding how faults …