Preventing useless checkpoints in distributed computations

EN Elnozahy, L Alvisi, YM Wang… - ACM Computing Surveys …, 2002 - dl.acm.org

This survey covers rollback-recovery techniques that do not require special language
constructs. In the first part of the survey we classify rollback-recovery protocols into …

被引用次数：2591 相关文章所有 52 个版本

Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery

EN Elnozahy, JS Plank - IEEE Transactions on Dependable …, 2004 - ieeexplore.ieee.org

Over the past two decades, rollback-recovery via checkpoint-restart has been used with
reasonable success for long-running applications, such as scientific workloads that take …

被引用次数：296 相关文章所有 8 个版本

[PDF] psu.edu

An analysis of communication induced checkpointing

L Alvisi, E Elnozahy, S Rao, SA Husain… - Digest of Papers …, 1999 - ieeexplore.ieee.org

Communication induced checkpointing (CIC) allows processes in a distributed computation
to take independent checkpoints and to avoid the domino effect. This paper presents an …

被引用次数：215 相关文章所有 14 个版本

[PDF] uth.gr

APOGEE: Adaptive prefetching on GPUs for energy efficiency

A Sethia, G Dasika, M Samadi… - Proceedings of the 22nd …, 2013 - ieeexplore.ieee.org

Modern graphics processing units (GPUs) combine large amounts of parallel hardware with
fast context switching among thousands of active threads to achieve high performance …

被引用次数：98 相关文章所有 11 个版本

[PDF] researchgate.net

Communication-induced determination of consistent snapshots

J Helary, A Mostefaoui, M Raynal - IEEE Transactions on …, 1999 - ieeexplore.ieee.org

A classical way to determine consistent snapshots consists in using Chandy-Lamport's
algorithm. This algorithm relies on specific control messages that allow processes to …

被引用次数：95 相关文章所有 19 个版本

[PDF] academia.edu

[图书][B] Concurrent and distributed computing in Java

VK Garg - 2005 - books.google.com

Concurrent and Distributed Computing in Java addresses fundamental concepts in
concurrent computing with Java examples. The book consists of two parts. The first part …

被引用次数：91 相关文章所有 6 个版本

[PDF] arxiv.org

Fault tolerance for remote memory access programming models

M Besta, T Hoefler - Proceedings of the 23rd international symposium on …, 2014 - dl.acm.org

Remote Memory Access (RMA) is an emerging mechanism for programming high-
performance computers and datacenters. However, little work exists on resilience schemes …

被引用次数：46 相关文章所有 28 个版本

[PDF] mdpi.com

Shadow replication: An energy-aware, fault-tolerant computational model for green cloud computing

X Cui, B Mills, T Znati, R Melhem - Energies, 2014 - mdpi.com

As the demand for cloud computing continues to increase, cloud service providers face the
daunting challenge to meet the negotiated SLA agreement, in terms of reliability and timely …

被引用次数：37 相关文章所有 20 个版本

[PDF] proquest.com

[图书][B] Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems

J Hursey - 2010 - search.proquest.com

Scientists use advanced computing techniques to assist in answering the complex questions
at the forefront of discovery. The High Performance Computing (HPC) scientific applications …

被引用次数：45 相关文章所有 6 个版本

[PDF] psu.edu

Staggered consistent checkpointing

NH Vaidya - IEEE Transactions on Parallel and distributed …, 1999 - ieeexplore.ieee.org

A consistent checkpointing algorithm saves a consistent view of a distributed application's
state on stable storage. The traditional consistent checkpointing algorithms require different …

被引用次数：67 相关文章所有 10 个版本

高级搜索

QQ 群