Designing and modelling selective replication for fault-tolerant hpc applications

O Subasi, G Yalcin, F Zyulkyarov… - 2017 17th IEEE/ACM …, 2017 - ieeexplore.ieee.org
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for
High Performance Computing (HPC) applications. There are studies that address fail-stop …

Automatic risk-based selective redundancy for fault-tolerant task-parallel hpc applications

O Subasi, O Unsal, S Krishnamoorthy - Proceedings of the Third …, 2017 - dl.acm.org
Silent data corruption (SDC) and fail-stop errors are the most hazardous error types in high-
performance computing (HPC) systems. In this study, we present an automatic, efficient and …

A methodology for soft errors detection and automatic recovery

D Montezanti, A De Giusti, M Naiouf… - … Conference on High …, 2017 - ieeexplore.ieee.org
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals
and silent faults are expected in the future. It is projected that, in exascale systems, errors …

Programmer-directed partial redundancy for resilient HPC

O Subasi, J Arias, O Unsal, J Labarta… - Proceedings of the 12th …, 2015 - dl.acm.org
In this work we propose partial task replication and checkpointing for task-parallel HPC
applications to mitigate silent data corruption (SDC) errors. As the complete replication of all …

Optimal resilience patterns to cope with fail-stop and silent errors

A Benoit, A Cavelan, Y Robert… - 2016 IEEE International …, 2016 - ieeexplore.ieee.org
This work focuses on resilience techniques at extreme scale. Many papers deal with fail-stop
errors. Many others deal with silent errors (or silent data corruptions). But very few papers …

Assuming failure independence: are we right to be wrong?

G Aupy, Y Robert, F Vivien - 2017 IEEE International …, 2017 - ieeexplore.ieee.org
This paper revisits the failure 1 temporal independence hypothesis which is omnipresent in
the analysis of resilience methods for HPC. We explain why a previous approach is …

Letgo: A lightweight continuous framework for hpc applications under failures

B Fang, Q Guan, N Debardeleben… - Proceedings of the 26th …, 2017 - dl.acm.org
Requirements for reliability, low power consumption, and performance place complex and
conflicting demands on the design of high-performance computing (HPC) systems. Fault …

Combining partial redundancy and checkpointing for HPC

J Elliott, K Kharbas, D Fiala, F Mueller… - 2012 IEEE 32nd …, 2012 - ieeexplore.ieee.org
Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^ 15)
floating point operations per second) and exascale systems are projected within seven …

Raas: Resilience as a service

J Villamayor, D Rexachs, E Luque… - 2018 18th IEEE/ACM …, 2018 - ieeexplore.ieee.org
Cloud computing is continuously increasing its popularity as key features such as scalability,
pay-per-use and availability continue to evolve. It is also becoming a competitive platform for …

An efficient silent data corruption detection method with error-feedback control and even sampling for HPC applications

S Di, E Berrocal, F Cappello - 2015 15th IEEE/ACM …, 2015 - ieeexplore.ieee.org
The silent data corruption (SDC) problem is attracting more and more attentions because it
is expected to have a great impact on exascale HPC applications. SDC faults are hazardous …