A survey of rollback-recovery protocols in message-passing systems

EN Elnozahy, L Alvisi, YM Wang… - ACM Computing Surveys …, 2002 - dl.acm.org
This survey covers rollback-recovery techniques that do not require special language
constructs. In the first part of the survey we classify rollback-recovery protocols into …

A survey of checkpointing algorithms for parallel and distributed computers

S Kalaiselvi, V Rajaraman - Sadhana, 2000 - Springer
Checkpoint is defined as a designated place in a program at which normal processing is
interrupted specifically to preserve the status information necessary to allow resumption of …

Message logging: Pessimistic, optimistic, causal, and optimal

L Alvisi, K Marzullo - IEEE Transactions on Software …, 1998 - ieeexplore.ieee.org
Message-logging protocols are an integral part of a popular technique for implementing
processes that can recover from crash failures. All message-logging protocols require that …

Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems

J Chung, I Lee, M Sullivan, JH Ryoo… - Scientific …, 2013 - content.iospress.com
This paper describes and evaluates a scalable and efficient resilience scheme based on the
concept of containment domains. Containment domains are a programming construct that …

Clonos: Consistent causal recovery for highly-available streaming dataflows

PF Silvestre, M Fragkoulis, D Spinellis… - Proceedings of the …, 2021 - dl.acm.org
Stream processing lies in the backbone of modern businesses, being employed for mission
critical applications such as real-time fraud detection, car-trip fare calculations, traffic …

A case for two-level distributed recovery schemes

NH Vaidya - Proceedings of the 1995 ACM SIGMETRICS joint …, 1995 - dl.acm.org
Most distributed and multiprocessor recovery schemes proposed in the literature are
designed to tolerate arbitrary number of failures. In this paper, we demonstrate that, it is often …

Message logging: Pessimistic, optimistic, and causal

L Alvisi, K Marzullo - Proceedings of 15th International …, 1995 - ieeexplore.ieee.org
Message logging protocols are an integral part of a technique for implementing processes
that can recover from crash failures. All message logging protocols require that, when …

Lazy checkpoint coordination for bounding rollback propagation

YM Wang, WK Fuchs - … of 1993 IEEE 12th Symposium on …, 1993 - ieeexplore.ieee.org
The technique of lazy checkpoint coordination, which preserves process autonomy while
employing communication-induced checkpoint coordination for bounding rollback …

Portable checkpointing for heterogeneous architectures

B Ramkumar, V Strumpen - Proceedings of IEEE 27th …, 1997 - ieeexplore.ieee.org
Current approaches for checkpointing assume system homogeneity, where checkpointing
and recovery are both performed on the same processor architecture and operating system …

The cost of recovery in message logging protocols

S Rao, L Alvisi, HM Vin - IEEE Transactions on Knowledge and …, 2000 - ieeexplore.ieee.org
Past research in message logging has focused on studying the relative overhead imposed
by pessimistic, optimistic and causal protocols during failure-free executions. In this paper …