S Agarwal, R Garg, MS Gupta, JE Moreira - Proceedings of the 18th …, 2004 - dl.acm.org
Given the scale of massively parallel systems, occurrence of faults is no longer an exception but a regular event. Periodic checkpointing is becoming increasingly important in these …
A Havemose, CP Ngan - US Patent 7,681,075, 2010 - Google Patents
US7681075B2 - Method and system for providing high availability to distributed computer applications - Google Patents US7681075B2 - Method and system for providing high …
JS Plank, Y Chen, K Li, M Beck… - Software: practice and …, 1999 - Wiley Online Library
Checkpointing systems are a convenient way for users to make their programs fault‐tolerant by intermittently saving program state to disk and restoring that state following a failure. The …
Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability guarantees of future large-scale systems. Disk-based coordinated …
JC Sancho, F Petrini, K Davis… - 19th IEEE …, 2005 - ieeexplore.ieee.org
Checkpoint/restart is a general idea for which particular implementations enable various functionalities in computer systems, including process migration, gang scheduling …
Memory check pointing is a pivotal technique in systems reliability, with applications ranging from crash recovery to replay debugging. Unfortunately, many traditional memory check …
Scientists use advanced computing techniques to assist in answering the complex questions at the forefront of discovery. The High Performance Computing (HPC) scientific applications …
The execution times of large-scale parallel applications on nowadays multi/many-core systems are usually longer than the mean time between failures. Therefore, parallel …
High-frequency memory checkpointing is an important technique in several application domains, such as automatic error recovery (where frequent checkpoints allow the system to …