Volpexmpi: An MPI library for execution of parallel applications on volatile nodes

D Fiala, F Mueller, C Engelmann… - SC'12: Proceedings …, 2012 - ieeexplore.ieee.org

Faults have become the norm rather than the exception for high-end computing clusters.
Exacerbating this situation, some of these faults remain undetected, manifesting themselves …

被引用次数：392 相关文章所有 31 个版本

[PDF] ncsu.edu

Combining partial redundancy and checkpointing for HPC

J Elliott, K Kharbas, D Fiala, F Mueller… - 2012 IEEE 32nd …, 2012 - ieeexplore.ieee.org

Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^ 15)
floating point operations per second) and exascale systems are projected within seven …

被引用次数：206 相关文章所有 20 个版本

[PDF] christian-engelmann.info

[PDF][PDF] Redundant execution of HPC applications with MR-MPI

C Engelmann, S Böhm - Proceedings of the 10th IASTED …, 2011 - christian-engelmann.info

This paper presents a modular-redundant Message Passing Interface (MPI) solution, MR-
MPI, for transparently executing high-performance computing (HPC) applications in a …

被引用次数：82 相关文章所有 16 个版本

[PDF] researchgate.net

Reachability testing: An approach to testing concurrent software

GH Hwang, KC Tai, TL Huang - International Journal of Software …, 1995 - World Scientific

Concurrent programs are more difficult to test than sequential programs because of non-
deterministic behavior. An execution of a concurrent program non-deterministically …

被引用次数：131 相关文章所有 13 个版本

[PDF] academia.edu

[PDF][PDF] Process Migration for Resilient Applications

K McGill, S Taylor - Dartmouth College, 2011 - academia.edu

The notion of resiliency is concerned with constructing mission-critical distributed
applications that are able to operate through a wide variety of failures, errors, and malicious …

被引用次数：56 相关文章所有 2 个版本

[PDF] ncsu.edu

Proactive process-level live migration and back migration in HPC environments

C Wang, F Mueller, C Engelmann, SL Scott - Journal of Parallel and …, 2012 - Elsevier

As the number of nodes in high-performance computing environments keeps increasing,
faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to …

被引用次数：64 相关文章所有 7 个版本

[PDF] researchgate.net

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer

With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …

被引用次数：8 相关文章所有 4 个版本

[PDF] proquest.com

[图书][B] Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems

J Hursey - 2010 - search.proquest.com

Scientists use advanced computing techniques to assist in answering the complex questions
at the forefront of discovery. The High Performance Computing (HPC) scientific applications …

被引用次数：45 相关文章所有 6 个版本

[PDF] acm.org

Replication is more efficient than you think

A Benoit, T Herault, VL Fèvre, Y Robert - Proceedings of the International …, 2019 - dl.acm.org

This paper revisits replication coupled with checkpointing for fail-stop errors. Replication
enables the application to survive many fail-stop errors, thereby allowing for longer …

被引用次数：18 相关文章所有 17 个版本

[PDF] iisc.ac.in

Fault tolerance on large scale systems using adaptive process replication

C George, S Vadhiyar - IEEE Transactions on Computers, 2014 - ieeexplore.ieee.org

Exascale systems of the future are predicted to have mean time between failures (MTBF) of
less than one hour. At such low MTBFs, employing periodic checkpointing alone will result in …

被引用次数：28 相关文章所有 9 个版本

高级搜索

QQ 群