A survey of rollback-recovery protocols in message-passing systems

EN Elnozahy, L Alvisi, YM Wang… - ACM Computing Surveys …, 2002 - dl.acm.org
This survey covers rollback-recovery techniques that do not require special language
constructs. In the first part of the survey we classify rollback-recovery protocols into …

Adaptive incremental checkpointing for massively parallel systems

S Agarwal, R Garg, MS Gupta, JE Moreira - Proceedings of the 18th …, 2004 - dl.acm.org
Given the scale of massively parallel systems, occurrence of faults is no longer an exception
but a regular event. Periodic checkpointing is becoming increasingly important in these …

Method and system for providing high availability to distributed computer applications

A Havemose, CP Ngan - US Patent 7,681,075, 2010 - Google Patents
US7681075B2 - Method and system for providing high availability to distributed computer
applications - Google Patents US7681075B2 - Method and system for providing high …

Memory exclusion: Optimizing the performance of checkpointing systems

JS Plank, Y Chen, K Li, M Beck… - Software: practice and …, 1999 - Wiley Online Library
Checkpointing systems are a convenient way for users to make their programs fault‐tolerant
by intermittently saving program state to disk and restoring that state following a failure. The …

libhashckpt: hash-based incremental checkpointing using gpu's

KB Ferreira, R Riesen, R Brighwell, P Bridges… - European MPI Users' …, 2011 - Springer
Concern is beginning to grow in the high-performance computing (HPC) community
regarding the reliability guarantees of future large-scale systems. Disk-based coordinated …

Current practice and a direction forward in checkpoint/restart implementations for fault tolerance

JC Sancho, F Petrini, K Davis… - 19th IEEE …, 2005 - ieeexplore.ieee.org
Checkpoint/restart is a general idea for which particular implementations enable various
functionalities in computer systems, including process migration, gang scheduling …

Lightweight memory checkpointing

D Vogt, C Giuffrida, H Bos… - 2015 45th Annual IEEE …, 2015 - ieeexplore.ieee.org
Memory check pointing is a pivotal technique in systems reliability, with applications ranging
from crash recovery to replay debugging. Unfortunately, many traditional memory check …

[图书][B] Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems

J Hursey - 2010 - search.proquest.com
Scientists use advanced computing techniques to assist in answering the complex questions
at the forefront of discovery. The High Performance Computing (HPC) scientific applications …

Improving scalability of application-level checkpoint-recovery by reducing checkpoint sizes

I Cores, G Rodríguez, MJ Martín, P González… - New Generation …, 2013 - Springer
The execution times of large-scale parallel applications on nowadays multi/many-core
systems are usually longer than the mean time between failures. Therefore, parallel …

Speculative memory checkpointing

D Vogt, A Miraglia, G Portokalidis, H Bos… - Proceedings of the 16th …, 2015 - dl.acm.org
High-frequency memory checkpointing is an important technique in several application
domains, such as automatic error recovery (where frequent checkpoints allow the system to …