A survey of rollback-recovery protocols in message-passing systems

EN Elnozahy, L Alvisi, YM Wang… - ACM Computing Surveys …, 2002 - dl.acm.org
This survey covers rollback-recovery techniques that do not require special language
constructs. In the first part of the survey we classify rollback-recovery protocols into …

Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery

EN Elnozahy, JS Plank - IEEE Transactions on Dependable …, 2004 - ieeexplore.ieee.org
Over the past two decades, rollback-recovery via checkpoint-restart has been used with
reasonable success for long-running applications, such as scientific workloads that take …

An analysis of communication induced checkpointing

L Alvisi, E Elnozahy, S Rao, SA Husain… - Digest of Papers …, 1999 - ieeexplore.ieee.org
Communication induced checkpointing (CIC) allows processes in a distributed computation
to take independent checkpoints and to avoid the domino effect. This paper presents an …

APOGEE: Adaptive prefetching on GPUs for energy efficiency

A Sethia, G Dasika, M Samadi… - Proceedings of the 22nd …, 2013 - ieeexplore.ieee.org
Modern graphics processing units (GPUs) combine large amounts of parallel hardware with
fast context switching among thousands of active threads to achieve high performance …

Communication-induced determination of consistent snapshots

J Helary, A Mostefaoui, M Raynal - IEEE Transactions on …, 1999 - ieeexplore.ieee.org
A classical way to determine consistent snapshots consists in using Chandy-Lamport's
algorithm. This algorithm relies on specific control messages that allow processes to …

[图书][B] Concurrent and distributed computing in Java

VK Garg - 2005 - books.google.com
Concurrent and Distributed Computing in Java addresses fundamental concepts in
concurrent computing with Java examples. The book consists of two parts. The first part …

Fault tolerance for remote memory access programming models

M Besta, T Hoefler - Proceedings of the 23rd international symposium on …, 2014 - dl.acm.org
Remote Memory Access (RMA) is an emerging mechanism for programming high-
performance computers and datacenters. However, little work exists on resilience schemes …

Shadow replication: An energy-aware, fault-tolerant computational model for green cloud computing

X Cui, B Mills, T Znati, R Melhem - Energies, 2014 - mdpi.com
As the demand for cloud computing continues to increase, cloud service providers face the
daunting challenge to meet the negotiated SLA agreement, in terms of reliability and timely …

[图书][B] Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems

J Hursey - 2010 - search.proquest.com
Scientists use advanced computing techniques to assist in answering the complex questions
at the forefront of discovery. The High Performance Computing (HPC) scientific applications …

Staggered consistent checkpointing

NH Vaidya - IEEE Transactions on Parallel and distributed …, 1999 - ieeexplore.ieee.org
A consistent checkpointing algorithm saves a consistent view of a distributed application's
state on stable storage. The traditional consistent checkpointing algorithms require different …