Probabilistic checkpointing

EN Elnozahy, L Alvisi, YM Wang… - ACM Computing Surveys …, 2002 - dl.acm.org

This survey covers rollback-recovery techniques that do not require special language
constructs. In the first part of the survey we classify rollback-recovery protocols into …

被引用次数：2591 相关文章所有 52 个版本

[PDF] psu.edu

Adaptive incremental checkpointing for massively parallel systems

S Agarwal, R Garg, MS Gupta, JE Moreira - Proceedings of the 18th …, 2004 - dl.acm.org

Given the scale of massively parallel systems, occurrence of faults is no longer an exception
but a regular event. Periodic checkpointing is becoming increasingly important in these …

被引用次数：258 相关文章所有 7 个版本

[PDF] googleapis.com

Method and system for providing high availability to distributed computer applications

A Havemose, CP Ngan - US Patent 7,681,075, 2010 - Google Patents

US7681075B2 - Method and system for providing high availability to distributed computer
applications - Google Patents US7681075B2 - Method and system for providing high …

被引用次数：142 相关文章所有 4 个版本

Memory exclusion: Optimizing the performance of checkpointing systems

JS Plank, Y Chen, K Li, M Beck… - Software: practice and …, 1999 - Wiley Online Library

Checkpointing systems are a convenient way for users to make their programs fault‐tolerant
by intermittently saving program state to disk and restoring that state following a failure. The …

被引用次数：147 相关文章所有 6 个版本

[PDF] osti.gov

libhashckpt: hash-based incremental checkpointing using gpu's

KB Ferreira, R Riesen, R Brighwell, P Bridges… - European MPI Users' …, 2011 - Springer

Concern is beginning to grow in the high-performance computing (HPC) community
regarding the reliability guarantees of future large-scale systems. Disk-based coordinated …

被引用次数：68 相关文章所有 11 个版本

[PDF] uniroma2.it

Current practice and a direction forward in checkpoint/restart implementations for fault tolerance

JC Sancho, F Petrini, K Davis… - 19th IEEE …, 2005 - ieeexplore.ieee.org

Checkpoint/restart is a general idea for which particular implementations enable various
functionalities in computer systems, including process migration, gang scheduling …

被引用次数：81 相关文章所有 9 个版本

[PDF] academia.edu

Lightweight memory checkpointing

D Vogt, C Giuffrida, H Bos… - 2015 45th Annual IEEE …, 2015 - ieeexplore.ieee.org

Memory check pointing is a pivotal technique in systems reliability, with applications ranging
from crash recovery to replay debugging. Unfortunately, many traditional memory check …

被引用次数：37 相关文章所有 10 个版本

[PDF] proquest.com

[图书][B] Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems

J Hursey - 2010 - search.proquest.com

Scientists use advanced computing techniques to assist in answering the complex questions
at the forefront of discovery. The High Performance Computing (HPC) scientific applications …

被引用次数：45 相关文章所有 6 个版本

[PDF] udc.es

Improving scalability of application-level checkpoint-recovery by reducing checkpoint sizes

I Cores, G Rodríguez, MJ Martín, P González… - New Generation …, 2013 - Springer

The execution times of large-scale parallel applications on nowadays multi/many-core
systems are usually longer than the mean time between failures. Therefore, parallel …

被引用次数：34 相关文章所有 9 个版本

[PDF] 139.91.90.193

Speculative memory checkpointing

D Vogt, A Miraglia, G Portokalidis, H Bos… - Proceedings of the 16th …, 2015 - dl.acm.org

High-frequency memory checkpointing is an important technique in several application
domains, such as automatic error recovery (where frequent checkpoints allow the system to …

被引用次数：24 相关文章所有 15 个版本

高级搜索

QQ 群

A survey of rollback-recovery protocols in message-passing systems