The landscape of exascale research: A data-driven literature analysis

S Heldens, P Hijma, BV Werkhoven… - ACM Computing …, 2020 - dl.acm.org
The next generation of supercomputers will break the exascale barrier. Soon we will have
systems capable of at least one quintillion (billion billion) floating-point operations per …

Methods for improving the reliability of intelligent semiconductor

S Park, S Jeon, B Kim, J Lee - 2021 IEEE International …, 2021 - ieeexplore.ieee.org
Intelligent semiconductors adopted in safety-critical systems such as autonomous vehicles
and medical devices require high reliability. It is necessary to analyze soft-errors that may …

User-level failure detection and auto-recovery of parallel programs in HPC systems

G Zhang, Y Liu, H Yang, J Xu, D Qian - Frontiers of Computer Science, 2021 - Springer
As the mean-time-between-failures (MTBF) continues to decline with the increasing number
of components on large-scale high performance computing (HPC) systems, program failures …

A novel approach for handling soft error in conjugate gradients

ME Ozturk, M Renardy, Y Li, G Agrawal… - 2018 IEEE 25th …, 2018 - ieeexplore.ieee.org
Soft errors or bit flips have recently become an important challenge in high performance
computing. In this paper, we focus on soft errors in a particular algorithm: conjugate …

[PDF][PDF] Handling soft errors in Krylov subspace methods by exploiting their numerical properties

ME Ozturk, G Agrawal, Y Li, CS Chou - 2020 - academia.edu
Krylov space methods are a popular means for solving sparse systems. In this paper, we
consider three such methods: GMRES, Conjugate Gradient (CG) and Conjugate Residual …

An extensive study on iterative solver resilience: characterization, detection and prediction

BO Mutlu - 2019 - upcommons.upc.edu
Soft errors caused by transient bit flips have the potential to significantly impactan
applicalion's behavior. This has motivated the design of an array of techniques to detect …

Ensemble learning based Architecture Vulnerability Factor calculation using partial feature set in processors

J Wang, J Jiao, Y Fu - Journal of Physics: Conference Series, 2019 - iopscience.iop.org
With the scaling technology, soft error induced bit upsets are increasingly threatening the
processor reliability. Processor designers require effective tools or methodologies to …