Silent data corruption (SDC) and fail-stop errors are the most hazardous error types in high- performance computing (HPC) systems. In this study, we present an automatic, efficient and …
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors …
In this work we propose partial task replication and checkpointing for task-parallel HPC applications to mitigate silent data corruption (SDC) errors. As the complete replication of all …
This work focuses on resilience techniques at extreme scale. Many papers deal with fail-stop errors. Many others deal with silent errors (or silent data corruptions). But very few papers …
This paper revisits the failure 1 temporal independence hypothesis which is omnipresent in the analysis of resilience methods for HPC. We explain why a previous approach is …
Requirements for reliability, low power consumption, and performance place complex and conflicting demands on the design of high-performance computing (HPC) systems. Fault …
J Elliott, K Kharbas, D Fiala, F Mueller… - 2012 IEEE 32nd …, 2012 - ieeexplore.ieee.org
Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^ 15) floating point operations per second) and exascale systems are projected within seven …
Cloud computing is continuously increasing its popularity as key features such as scalability, pay-per-use and availability continue to evolve. It is also becoming a competitive platform for …
S Di, E Berrocal, F Cappello - 2015 15th IEEE/ACM …, 2015 - ieeexplore.ieee.org
The silent data corruption (SDC) problem is attracting more and more attentions because it is expected to have a great impact on exascale HPC applications. SDC faults are hazardous …