A survey of rollback-recovery protocols in message-passing systems

EN Elnozahy, L Alvisi, YM Wang… - ACM Computing Surveys …, 2002 - dl.acm.org
This survey covers rollback-recovery techniques that do not require special language
constructs. In the first part of the survey we classify rollback-recovery protocols into …

Replication for web hosting systems

S Sivasubramanian, M Szymaniak, G Pierre… - ACM Computing …, 2004 - dl.acm.org
Replication is a well-known technique to improve the accessibility of Web sites. It generally
offers reduced client latencies and increases a site's availability. However, applying …

Tachyon: Reliable, memory speed storage for cluster computing frameworks

H Li, A Ghodsi, M Zaharia, S Shenker… - Proceedings of the ACM …, 2014 - dl.acm.org
Tachyon is a distributed file system enabling reliable data sharing at memory speed across
cluster computing frameworks. While caching today improves read workloads, writes are …

[图书][B] Distributed systems

M Van Steen, AS Tanenbaum - 2017 - dgma.donetsk.ua
This is the third edition of “Distributed Systems.” In many ways, it is a huge difference
compared to the previous editions, the most important one perhaps being that we have fully …

The design and implementation of checkpoint/restart process fault tolerance for Open MPI

J Hursey, JM Squyres, TI Mattox… - 2007 IEEE …, 2007 - ieeexplore.ieee.org
To be able to fully exploit ever larger computing platforms, modern HPC applications and
system software must be able to tolerate inevitable faults. Historically, MPI implementations …

Database replication using generalized snapshot isolation

S Elnikety, F Pedone… - 24th IEEE Symposium on …, 2005 - ieeexplore.ieee.org
Generalized snapshot isolation extends snapshot isolation as used in Oracle and other
databases in a manner suitable for replicated databases. While (conventional) snapshot …

Uncoordinated checkpointing without domino effect for send-deterministic MPI applications

A Guermouche, T Ropars, E Brunet… - … Parallel & Distributed …, 2011 - ieeexplore.ieee.org
As reported by many recent studies, the mean time between failures of future post-petascale
supercomputers is likely to reduce, compared to the current situation. The most popular fault …

[PS][PS] Adaptive and reliable parallel computing on networks of workstations

RD Blumofe, PA Lisiecki - … Annual Technical Conference on UNIX and …, 1997 - usenix.org
In this paper, we present the design of Cilk-NOW, a runtime system that adaptively and
reliably executes functional Cilk programs in parallel on a network of UNIX workstations. Cilk …

Clonos: Consistent causal recovery for highly-available streaming dataflows

PF Silvestre, M Fragkoulis, D Spinellis… - Proceedings of the …, 2021 - dl.acm.org
Stream processing lies in the backbone of modern businesses, being employed for mission
critical applications such as real-time fraud detection, car-trip fare calculations, traffic …

Lineage stash: fault tolerance off the critical path

S Wang, J Liagouris, R Nishihara, P Moritz… - Proceedings of the 27th …, 2019 - dl.acm.org
As cluster computing frameworks such as Spark, Dryad, Flink, and Ray are being deployed
in mission critical applications and on larger and larger clusters, their ability to tolerate …