Comparison of the existing checkpoint systems

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

IP Egwutuoha, D Levy, B Selic, S Chen - The Journal of Supercomputing, 2013 - Springer

Abstract In recent years, High Performance Computing (HPC) systems have been shifting
from expensive massively parallel architectures to clusters of commodity PCs to take …

被引用次数：346 相关文章所有 12 个版本

[PDF] arxiv.org

DMTCP: Transparent checkpointing for cluster computations and the desktop

J Ansel, K Arya, G Cooperman - 2009 IEEE international …, 2009 - ieeexplore.ieee.org

DMTCP (distributed multithreaded checkpointing) is a transparent user-level checkpointing
package for distributed applications. Checkpointing and restart is demonstrated for a wide …

被引用次数：469 相关文章所有 24 个版本

Performance evaluation of checkpoint/restart techniques: For MPI applications on Amazon cloud

BA Azeem, M Helal - 2014 9th International Conference on …, 2014 - ieeexplore.ieee.org

Distributed applications running on a large cluster environment, such as the cloud instances
will have shorter execution time. However, the application might suffer from sudden …

被引用次数：10 相关文章所有 2 个版本

[PDF] core.ac.uk

A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud

IP Egwutuoha - 2013 - ses.library.usyd.edu.au

High Performance Computing (HPC) systems have been widely used by scientists and
researchers in both industry and university laboratories to solve advanced computation …

被引用次数：3 相关文章所有 2 个版本

[PDF] sagepub.com

A rule-based domain specific language for fault management

O Kaya, S Hashemikhabir, C Togay… - Journal of Integrated …, 2010 - content.iospress.com

In this article, we propose a domain specific language for the" fault management for mission
critical systems" domain that also supports rule-based operation. Variability management for …

被引用次数：4 相关文章所有 5 个版本

[PDF] metu.edu.tr

A Rule-based domain specific language for fault management

Ö Kaya - 2014 - open.metu.edu.tr

A fault management framework has been developed where a rule-based event processing
language is also developed that provides improvement to the existing approaches in terms …

被引用次数：3 相关文章所有 2 个版本

[PDF] usp.br

MidHPC: Um suporte para a execução transparente de aplicações em grids computacionais

JA Andrade Filho - 2008 - teses.usp.br

Pesquisas em sistemas paralelos e distribuídos de alto desempenho apresentam limitações
no que se refere a análise, projeto, implementação e execução automática e transparente …

被引用次数：1 相关文章所有 4 个版本

[PDF] arxiv.org

Performance Evaluation of Checkpoint/Restart Techniques

BA Azeem, M Helal - arXiv preprint arXiv:2311.17545, 2023 - arxiv.org

Distributed applications running on a large cluster environment, such as the cloud instances
will have shorter execution time. However, the application might suffer from sudden …

被引用次数：1 相关文章所有 2 个版本

[PDF] iiitd.edu.in

Enabling sender-initiated distributed applications and checkpointing in content centric networks

N Mohan, P Singh - 2015 - repository.iiitd.edu.in

Content Centric Network is a proposed future networking paradigm where data is the central
entity for communication and the correspondence model follows two-step approach for data …

[PDF][PDF] Be Kind, Rewind

I Ljubuncic, A Rozenfeld, A Goldis, R Giri - ieee-hpec.org

Intel's chip design run in a large-scale globally distributed environment with 600,000 cores.
In the current semiconductor market scenario, a combination of factors such as time to …

高级搜索

QQ 群