Detection and correction of silent data corruption for large-scale high-performance computing

D Fiala, F Mueller, C Engelmann… - SC'12: Proceedings …, 2012 - ieeexplore.ieee.org
Faults have become the norm rather than the exception for high-end computing clusters.
Exacerbating this situation, some of these faults remain undetected, manifesting themselves …

Combining partial redundancy and checkpointing for HPC

J Elliott, K Kharbas, D Fiala, F Mueller… - 2012 IEEE 32nd …, 2012 - ieeexplore.ieee.org
Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^ 15)
floating point operations per second) and exascale systems are projected within seven …

[PDF][PDF] Redundant execution of HPC applications with MR-MPI

C Engelmann, S Böhm - Proceedings of the 10th IASTED …, 2011 - christian-engelmann.info
This paper presents a modular-redundant Message Passing Interface (MPI) solution, MR-
MPI, for transparently executing high-performance computing (HPC) applications in a …

Reachability testing: An approach to testing concurrent software

GH Hwang, KC Tai, TL Huang - International Journal of Software …, 1995 - World Scientific
Concurrent programs are more difficult to test than sequential programs because of non-
deterministic behavior. An execution of a concurrent program non-deterministically …

[PDF][PDF] Process Migration for Resilient Applications

K McGill, S Taylor - Dartmouth College, 2011 - academia.edu
The notion of resiliency is concerned with constructing mission-critical distributed
applications that are able to operate through a wide variety of failures, errors, and malicious …

Proactive process-level live migration and back migration in HPC environments

C Wang, F Mueller, C Engelmann, SL Scott - Journal of Parallel and …, 2012 - Elsevier
As the number of nodes in high-performance computing environments keeps increasing,
faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to …

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer
With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …

[图书][B] Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems

J Hursey - 2010 - search.proquest.com
Scientists use advanced computing techniques to assist in answering the complex questions
at the forefront of discovery. The High Performance Computing (HPC) scientific applications …

Replication is more efficient than you think

A Benoit, T Herault, VL Fèvre, Y Robert - Proceedings of the International …, 2019 - dl.acm.org
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication
enables the application to survive many fail-stop errors, thereby allowing for longer …

Fault tolerance on large scale systems using adaptive process replication

C George, S Vadhiyar - IEEE Transactions on Computers, 2014 - ieeexplore.ieee.org
Exascale systems of the future are predicted to have mean time between failures (MTBF) of
less than one hour. At such low MTBFs, employing periodic checkpointing alone will result in …