Selective fault tolerance for register files of graphics processing units

M Goncalves, F Fernandes, I Lamb… - … on Nuclear Science, 2019 - ieeexplore.ieee.org
… correct result, and HPC applications that must guarantee time … a good but inefficient
fault-tolerant solution. We evaluate the … up until where we identify the design’s most sensitive and …

Hermes: A fast, fault-tolerant and linearizable replication protocol

A Katsarakis, V Gavrielatos, MRS Katebzadeh… - Proceedings of the …, 2020 - dl.acm.org
… This work addresses the challenge of designing a reliable replication protocol that provides
both … Failure model We consider a partially synchronous system [34] where processes are …

A new fault-tolerant algorithm based on replication and preemptive migration in cloud computing

A Semmoud, M Hakem, B Benmammar… - … of Cloud Applications …, 2022 - igi-global.com
… The fault tolerant algorithm Theproposedmodelhandlesfaultsthatmayoccurinallavailablevirtual…
and parallel and distributed algorithms design for sensor networks, clusters and grids. …

End-to-end resilience for HPC applications

A Rezaei, H Khetawat, O Patil, F Mueller… - … Performance Computing, 2019 - Springer
… We design and implement a resilience pragma to support … Our fault model considers soft
errors/SDCs that materialize in … it the need for task-based replication. Unlike our work, they do …

Predictive reliability and fault management in exascale systems: State of the art and perspectives

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org
… as design considerations in current and future HPC systems. … of the art on relevant fault
models for HPC systems (Section 2… to be “scalable, efficient, fault tolerant and easy-to-manage” […

teaMPI—replication-based resilience without the (performance) pain

P Samfass, T Weinzierl, B Hazelwood… - … Performance Computing …, 2020 - Springer
Designing and modelling selective replication for fault-tolerant HPC applications. In: 17th
IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 452…

Towards resilient graphics processing units: designing fault tolerance techniques for radiation-induced faults

MM Gonçalves - 2024 - lume.ufrgs.br
… in the development of reliable, fault-tolerant GPU architectures. Our … HPC applications.
The frequency of these faults in clustered … We propose and evaluate selective fault tolerance …

A survey on multithreading alternatives for soft error fault tolerance

I Oz, S Arslan - ACM Computing Surveys (CSUR), 2019 - dl.acm.org
… and their extensions and discuss the design choices employed by the … sphere of replication,
environment, hardware model, or … [71] observe that the proposed fault-tolerant CMPs suffer …

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

T Benacchio, L Bonaventura… - … High Performance …, 2021 - journals.sagepub.com
… The most basic form of resilience is replication, whereby … This section contains illustrative
applications of the fault-tolerantdesign patterns across the layers of HPC systems and …

Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators-Trends in Quantum Computing, Heterogeneous Systems and …

S Venkatesha, R Parthasarathi - ACM Computing Surveys, 2024 - dl.acm.org
Fault tolerant designs are provided to protect the remaining portion of the die covering CPU
… In strict model of replication, Reunion incurs an average performance penalty of 5% and 2% …