A survey of rollback-recovery protocols in message-passing systems

EN Elnozahy, L Alvisi, YM Wang… - ACM Computing Surveys …, 2002 - dl.acm.org
This survey covers rollback-recovery techniques that do not require special language
constructs. In the first part of the survey we classify rollback-recovery protocols into …

DMTCP: Transparent checkpointing for cluster computations and the desktop

J Ansel, K Arya, G Cooperman - 2009 IEEE international …, 2009 - ieeexplore.ieee.org
DMTCP (distributed multithreaded checkpointing) is a transparent user-level checkpointing
package for distributed applications. Checkpointing and restart is demonstrated for a wide …

PLFS: A checkpoint filesystem for parallel applications

J Bent, G Gibson, G Grider, B McClelland… - Proceedings of the …, 2009 - dl.acm.org
Parallel applications running across thousands of processors must protect themselves from
inevitable system failures. Many applications insulate themselves from failures by …

Diskless checkpointing

JS Plank, K Li, MA Puening - IEEE Transactions on parallel and …, 1998 - ieeexplore.ieee.org
Diskless Checkpointing is a technique for checkpointing the state of a long-running
computation on a distributed system without relying on stable storage. As such, it eliminates …

[PDF][PDF] Fault tolerance in grid computing: state of the art and open issues

R Garg, AK Singh - International Journal of Computer Science and …, 2011 - academia.edu
Fault tolerance is an important property for large scale computational grid systems, where
geographically distributed nodes co-operate to execute a task. In order to achieve high level …

Adaptive incremental checkpointing for massively parallel systems

S Agarwal, R Garg, MS Gupta, JE Moreira - Proceedings of the 18th …, 2004 - dl.acm.org
Given the scale of massively parallel systems, occurrence of faults is no longer an exception
but a regular event. Periodic checkpointing is becoming increasingly important in these …

Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery

EN Elnozahy, JS Plank - IEEE Transactions on Dependable …, 2004 - ieeexplore.ieee.org
Over the past two decades, rollback-recovery via checkpoint-restart has been used with
reasonable success for long-running applications, such as scientific workloads that take …

Exploration of lossy compression for application-level checkpoint/restart

N Sasaki, K Sato, T Endo… - 2015 IEEE international …, 2015 - ieeexplore.ieee.org
The scale of high performance computing (HPC) systems is exponentially growing,
potentially causing prohibitive shrinkage of mean time between failures (MTBF) while the …

MCREngine: A scalable checkpointing system using data-aware aggregation and compression

TZ Islam, K Mohror, S Bagchi, A Moody… - SC'12: Proceedings …, 2012 - ieeexplore.ieee.org
High performance computing (HPC) systems use checkpoint-restart to tolerate failures.
Typically, applications store their states in checkpoints on a parallel file system (PFS). As …

Optimistic deltas for WWW latency reduction

G Banga, F Douglis, M Rabinovich - Proc. 1997 USENIX Technical …, 1997 - usenix.org
When a machine is connected to the Internet via a slow network, such as a 28.8 Kbps
modem, the cumulative latency to communicate over the Internet to World Wide Web servers …