Spatial support vector regression to detect silent errors in the exascale era

O Subasi, S Di, L Bautista-Gomez… - 2016 16th IEEE/ACM …, 2016 - ieeexplore.ieee.org
… and power challenges is expected to increase error rates. Thus, reliability is a serious
concern in the exascale era. Silent data corruptions (SDCs) or silent errors are one of the most …

Reliability for exascale computing: system modelling and error mitigation for task-parallel HPC applications

O Subasi - 2016 - upcommons.upc.edu
… As seen, increasing failure rate - expected for the Exascale era - is the dominating factor for
… -of-the-art on silent data corruptions or silent errors which are the type of errors that we aim to …

Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com
… and describe the problem of resilience in the exascale era. In Section 2, we present a …
The question is, how do we mitigate them, especially for silent errors that may lead to SDC? …

diffReplication--An Energy-Aware Fault Tolerance Model for Silent Error Detection and Mitigation in Heterogeneous Extreme-scale Computing Environment.

L Li, T Znati, R Melhem - JUCS: Journal of Universal …, 2023 - search.ebscohost.com
… As hardware and software advances are made to usher in the next scientific era of … errors,
which will be referred to as silent errors (SEs) in this paper, are neither bugs nor software errors

Predictive reliability and fault management in exascale systems: State of the art and perspectives

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org
… technology scaling in future Exascale systems. Technology scaling … paths to meet the reliability
levels of Exascale systems. … vector regression to detect silent errors in the exascale era. In …

Unprotected computing: A large-scale study of dram raw error rate on a supercomputer

L Bautista-Gomez, F Zyulkyarov… - SC'16: Proceedings …, 2016 - ieeexplore.ieee.org
errors escaping hardware checks, which lead to silent data corruption. This work attempts to
fill that gap by analyzing memory errors … proxy for the future exascale era DDRs. There are …

The path to exascale: Code optimizations and hardening solutions reliability

DAG de Oliveira, L Pilla, C Lunardi, L Carro… - Proceedings of the 5th …, 2015 - dl.acm.org
… that produce crashes as well as Silent Data Corruption (SDC) … era. The approach we adopt
to perform these evaluations is … For an exascale system a much higher portion of errors may …

Marriage between coordinated and uncoordinated checkpointing for the exascale era

O Subasi, F Zyulkyarov, O Unsal… - 2015 IEEE 17th …, 2015 - ieeexplore.ieee.org
… projected to be prohibitively expensive in the Exascale era. These techniques are most often
… fail-stop errors. The undetected errors, called silent errors, are out of the scope of this study. …

Neural network based silent error detector

C Wang, N Dryden, F Cappello… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org
… Abstract—As we move toward exascale platforms, silent data corruptions (… of silent errors
in different HPC applications. We show that for certain types of applications, large silent errors

Towards sustainable exascale computing

R Gioiosa - 2010 18th IEEE/IFIP International Conference on …, 2010 - ieeexplore.ieee.org
… and opportunities on the way to the exascale era. This paper shows how a closer … errors
are, however, silent, ie, cannot be detected. Exascale systems need to consider all errors