Compressed differences: An algorithm for fast incremental checkpointing

EN Elnozahy, L Alvisi, YM Wang… - ACM Computing Surveys …, 2002 - dl.acm.org

This survey covers rollback-recovery techniques that do not require special language
constructs. In the first part of the survey we classify rollback-recovery protocols into …

被引用次数：2591 相关文章所有 52 个版本

[PDF] arxiv.org

DMTCP: Transparent checkpointing for cluster computations and the desktop

J Ansel, K Arya, G Cooperman - 2009 IEEE international …, 2009 - ieeexplore.ieee.org

DMTCP (distributed multithreaded checkpointing) is a transparent user-level checkpointing
package for distributed applications. Checkpointing and restart is demonstrated for a wide …

被引用次数：469 相关文章所有 24 个版本

[PDF] cmu.edu

PLFS: A checkpoint filesystem for parallel applications

J Bent, G Gibson, G Grider, B McClelland… - Proceedings of the …, 2009 - dl.acm.org

Parallel applications running across thousands of processors must protect themselves from
inevitable system failures. Many applications insulate themselves from failures by …

被引用次数：492 相关文章所有 23 个版本

[PDF] utk.edu

Diskless checkpointing

JS Plank, K Li, MA Puening - IEEE Transactions on parallel and …, 1998 - ieeexplore.ieee.org

Diskless Checkpointing is a technique for checkpointing the state of a long-running
computation on a distributed system without relying on stable storage. As such, it eliminates …

被引用次数：538 相关文章所有 13 个版本

[PDF] academia.edu

[PDF][PDF] Fault tolerance in grid computing: state of the art and open issues

R Garg, AK Singh - International Journal of Computer Science and …, 2011 - academia.edu

Fault tolerance is an important property for large scale computational grid systems, where
geographically distributed nodes co-operate to execute a task. In order to achieve high level …

被引用次数：70 相关文章所有 5 个版本

[PDF] psu.edu

Adaptive incremental checkpointing for massively parallel systems

S Agarwal, R Garg, MS Gupta, JE Moreira - Proceedings of the 18th …, 2004 - dl.acm.org

Given the scale of massively parallel systems, occurrence of faults is no longer an exception
but a regular event. Periodic checkpointing is becoming increasingly important in these …

被引用次数：258 相关文章所有 7 个版本

Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery

EN Elnozahy, JS Plank - IEEE Transactions on Dependable …, 2004 - ieeexplore.ieee.org

Over the past two decades, rollback-recovery via checkpoint-restart has been used with
reasonable success for long-running applications, such as scientific workloads that take …

被引用次数：297 相关文章所有 8 个版本

[PDF] github.io

Exploration of lossy compression for application-level checkpoint/restart

N Sasaki, K Sato, T Endo… - 2015 IEEE international …, 2015 - ieeexplore.ieee.org

The scale of high performance computing (HPC) systems is exponentially growing,
potentially causing prohibitive shrinkage of mean time between failures (MTBF) while the …

被引用次数：122 相关文章所有 8 个版本

[PDF] wiley.com

MCREngine: A scalable checkpointing system using data-aware aggregation and compression

TZ Islam, K Mohror, S Bagchi, A Moody… - SC'12: Proceedings …, 2012 - ieeexplore.ieee.org

High performance computing (HPC) systems use checkpoint-restart to tolerate failures.
Typically, applications store their states in checkpoints on a parallel file system (PFS). As …

被引用次数：139 相关文章所有 24 个版本

[PS] usenix.org

Optimistic deltas for WWW latency reduction

G Banga, F Douglis, M Rabinovich - Proc. 1997 USENIX Technical …, 1997 - usenix.org

When a machine is connected to the Internet via a slow network, such as a 28.8 Kbps
modem, the cumulative latency to communicate over the Internet to World Wide Web servers …

被引用次数：179 相关文章所有 12 个版本

高级搜索

QQ 群