A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

IP Egwutuoha, D Levy, B Selic, S Chen - The Journal of Supercomputing, 2013 - Springer
Abstract In recent years, High Performance Computing (HPC) systems have been shifting
from expensive massively parallel architectures to clusters of commodity PCs to take …

DMTCP: Transparent checkpointing for cluster computations and the desktop

J Ansel, K Arya, G Cooperman - 2009 IEEE international …, 2009 - ieeexplore.ieee.org
DMTCP (distributed multithreaded checkpointing) is a transparent user-level checkpointing
package for distributed applications. Checkpointing and restart is demonstrated for a wide …

Performance evaluation of checkpoint/restart techniques: For MPI applications on Amazon cloud

BA Azeem, M Helal - 2014 9th International Conference on …, 2014 - ieeexplore.ieee.org
Distributed applications running on a large cluster environment, such as the cloud instances
will have shorter execution time. However, the application might suffer from sudden …

A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud

IP Egwutuoha - 2013 - ses.library.usyd.edu.au
High Performance Computing (HPC) systems have been widely used by scientists and
researchers in both industry and university laboratories to solve advanced computation …

A rule-based domain specific language for fault management

O Kaya, S Hashemikhabir, C Togay… - Journal of Integrated …, 2010 - content.iospress.com
In this article, we propose a domain specific language for the" fault management for mission
critical systems" domain that also supports rule-based operation. Variability management for …

A Rule-based domain specific language for fault management

Ö Kaya - 2014 - open.metu.edu.tr
A fault management framework has been developed where a rule-based event processing
language is also developed that provides improvement to the existing approaches in terms …

MidHPC: Um suporte para a execução transparente de aplicações em grids computacionais

JA Andrade Filho - 2008 - teses.usp.br
Pesquisas em sistemas paralelos e distribuídos de alto desempenho apresentam limitações
no que se refere a análise, projeto, implementação e execução automática e transparente …

Performance Evaluation of Checkpoint/Restart Techniques

BA Azeem, M Helal - arXiv preprint arXiv:2311.17545, 2023 - arxiv.org
Distributed applications running on a large cluster environment, such as the cloud instances
will have shorter execution time. However, the application might suffer from sudden …

Enabling sender-initiated distributed applications and checkpointing in content centric networks

N Mohan, P Singh - 2015 - repository.iiitd.edu.in
Content Centric Network is a proposed future networking paradigm where data is the central
entity for communication and the correspondence model follows two-step approach for data …

[PDF][PDF] Be Kind, Rewind

I Ljubuncic, A Rozenfeld, A Goldis, R Giri - ieee-hpec.org
Intel's chip design run in a large-scale globally distributed environment with 600,000 cores.
In the current semiconductor market scenario, a combination of factors such as time to …