Many methods are available to detect silent errors in high-performance computing (HPC) applications. Each method comes with a cost, a recall (fraction of all errors that are actually …
As applications move towards extreme scales, data-related challenges are becoming significant concerns, and in-situ workflows based on data staging and in-situ/in-transit data …
Undetected soft errors caused by transient bit flips can lead to silent data corruption (SDC), an undesirable outcome where invalid results pass for valid ones. This has motivated the …
J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer
With the scaling up of high-performance computing systems in recent years, their reliability has been descending continuously. Therefore, system resilience has been regarded as one …
Projections and measurements of error rates in near-exascale and exascale systems suggest a dramatic growth, due to extreme scale (10^ 9 cores), concurrency, software …
A Datta, R Schmidt, K Aberer - Seventh IEEE International …, 2007 - ieeexplore.ieee.org
Query-load (forwarding and answering) balancing in structured overlays is one of the most critical and least studied problems. It has been assumed that caching heuristics can take …
This paper provides a model and an analytical study of replication as a technique to detect and correct silent errors. Although other detection techniques exist for HPC applications …
G Zhang, Y Liu, H Yang, D Qian - The Journal of Supercomputing, 2022 - Springer
Nowadays, high-performance computing (HPC) is stepping forward to exascale era. However, silent data corruption (SDC) behaved as bit-flipping can cause disastrous …
M Heene, AP Hinojosa, M Obersteiner… - … Computing in Science …, 2018 - Springer
Within the current reporting period (04/2016–04/2017) of our HLRS project we have developed a scalable implementation of the fault-tolerant combination technique. Fault …