Predictive reliability and fault management in exascale systems: State of the art and perspectives

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org
… Finally, we identify the promising paths to meet the reliability levels of … Spatial support
vector regression to detect silent errors in the exascale era. In Proceedings of the 16th IEEE/ACM …

GPGPU Reliability Analysis: From Applications to Large Scale Systems

B Nie - 2019 - scholarworks.wm.edu
… soft errors are linked to several temporal and spatial features, … As we progressing towards
exascale, applications are going to … resilience in GPUs and demonstrate that the ratio of silent

The landscape of exascale research: A data-driven literature analysis

S Heldens, P Hijma, BV Werkhoven… - ACM Computing …, 2020 - dl.acm.org
… that exascale systems will experience faults and errors more fre… However, silent data corruption
(SDC) might require more … a global memory address space partitioned across the nodes. …

The ESCAPE project: energy-efficient scalable algorithms for weather prediction at exascale

A Müller, W Deconinck, C Kühnlein… - Geoscientific Model …, 2019 - gmd.copernicus.org
… ESCAPE strategy: (i) identify domain-specific key algorithmic … the optimisations, whereas
the error measures verify to what … most efficiently in grid point space, while horizontal gradients…

Sentiment analysis based error detection for large-scale systems

KA Alharthi, A Jhumka, S Di… - 2021 51st Annual …, 2021 - ieeexplore.ieee.org
… are designed/utilized towards exascale computing, inevitably … on one-day period (27-March-2017)
due to space limit and … using partial labels based on PU learning and Support Vector

[图书][B] Toward Resilience and Data Reduction in Exascale Scientific Computing

X Liang - 2019 - search.proquest.com
… be more susceptible to soft errors ,eg silent data corruptions, … is able to detect errors online
soon after the error occurs so … for timely detection, faster recovery and less space overhead …

Using machine learning techniques to evaluate multicore soft error reliability

FR da Rosa, R Garibotti, L Ost… - IEEE Transactions on …, 2019 - ieeexplore.ieee.org
… different ML algorithms (eg, support vector machines, k-Nearest … To inject faults and check
for errors, this work introduces … investigations during early design space explorations process, …

Response of HPC hardware to neutron radiation at the dawn of exascale

A Bustos, AJ Rubio-Montero, R Méndez… - The Journal of …, 2023 - Springer
silent data corruption detectors by leveraging support vector … The right replication level to
detect and correct silent errors at … their spatial locality and providing the mean relative error (…

Resiliency in Numerical Algorithm Design for Extreme Scale Simulations (Dagstuhl Seminar 20101)

L Giraud, U Rüde, L Stals - 2020 - drops.dagstuhl.de
… scientists with expertise in exascale computing to discuss novel … for detection, containment
and mitigation of silent data … -resolution in space or time and the error estimators themselves …

Predicting faults in high performance computing systems: An in-depth survey of the state-of-the-practice

D Jauk, D Yang, M Schulz - … of the International Conference for High …, 2019 - dl.acm.org
… As we near exascale, resilience remains a major technical … spatial and temporal correlation
of memory errors to identify … RIPPER and the support vector machine perform reasonably well …