Volpexmpi: An MPI library for execution of parallel applications on volatile nodes

T LeBlanc, R Anand, E Gabriel, J Subhlok - Recent Advances in Parallel …, 2009 - Springer
The objective of this research is to convert ordinary idle PCs into virtual clusters for
executing parallel applications. The paper introduces VolpexMPI that is designed to enable …

[图书][B] Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems

J Hursey - 2010 - search.proquest.com
Scientists use advanced computing techniques to assist in answering the complex questions
at the forefront of discovery. The High Performance Computing (HPC) scientific applications …

An efficient in-memory checkpoint method and its practice on fault-tolerant HPL

X Tang, J Zhai, B Yu, W Chen… - IEEE Transactions on …, 2017 - ieeexplore.ieee.org
Fault tolerance is increasingly important in high-performance computing due to the
substantial growth of system scale and decreasing system reliability. In-memory/diskless …

Self-checkpoint: An in-memory checkpoint method using less space and its practice on fault-tolerant HPL

X Tang, J Zhai, B Yu, W Chen, W Zheng - Acm Sigplan Notices, 2017 - dl.acm.org
Fault tolerance is increasingly important in high performance computing due to the
substantial growth of system scale and decreasing system reliability. In-memory/diskless …

[PDF][PDF] A composable runtime recovery policy framework supporting resilient HPC applications

J Hursey, A Lumsdaine - Indiana University, Bloomington …, 2010 - legacy.cs.indiana.edu
An HPC application must be resilient to sustain itself in the event of process loss due to the
high probability of hardware failure on modern HPC systems. These applications rely on …

Providing non-stop service for message-passing based parallel applications with radic

G Santos, A Duarte, D Rexachs, E Luque - Euro-Par 2008–Parallel …, 2008 - Springer
The current supercomputers are almost achieving the petaflop level. These machines
present a high number of interruptions in a relatively short time interval. Fault tolerance and …

[图书][B] RADIC: a powerful fault-tolerant architecture

AA Duarte - 2007 - ddd.uab.cat
La tolerancia a fallos se ha convertido en un requerimiento importante para los ingenieros
informáticos y los desarrolladores de software, debido a que la ocurrencia de fallos …

Toward a scalable, transactional, fault-tolerant message passing interface for petascale and exascale machines

A Hassani - 2016 - search.proquest.com
Increases in the scale of computing machines directly correlate with the rate of failures. High
Performance Computing (HPC) applications provide fault-tolerance through redundancy in …

A robust and efficient message passing library for volunteer computing environments

R Anand, T LeBlanc, E Gabriel, J Subhlok - Journal of Grid Computing, 2011 - Springer
The objective of this research is to convert ordinary idle PCs into virtual clusters for
executing parallel applications. The paper presents VolpexMPI that is designed to enable …

High availability for parallel computers

D Rexachs del Rosario… - Journal of Computer …, 2010 - sedici.unlp.edu.ar
Fault tolerance has become an important issue for parallel applications in the last few years.
The parallel systems' users want them to be reliable considering two main dimensions …