A survey of fault-tolerance and fault-recovery techniques in parallel systems

M Treaster - arXiv preprint cs/0501002, 2005 - arxiv.org
Supercomputing systems today often come in the form of large numbers of commodity
systems linked together into a computing cluster. These systems, like any distributed system …

Algorithm-based fault tolerance applied to high performance computing

G Bosilca, R Delmas, J Dongarra, J Langou - Journal of Parallel and …, 2009 - Elsevier
We present a new approach to fault tolerance for High Performance Computing system. Our
approach is based on a careful adaptation of the Algorithm-Based Fault Tolerance …

Using migratable objects to enhance fault tolerance schemes in supercomputers

E Meneses, X Ni, G Zheng… - IEEE transactions on …, 2014 - ieeexplore.ieee.org
Supercomputers have seen an exponential increase in their size in the last two decades.
Such a high growth rate is expected to take us to exascale in the timeframe 2018-2022. But …

[图书][B] Fault-tolerant parallel computation

PC Kanellakis, AA Shvartsman - 2013 - books.google.com
Fault-Tolerant Parallel Computation presents recent advances in algorithmic ways of
introducing fault-tolerance in multiprocessors under the constraint of preserving efficiency …

Parallel processing on networks of workstations: A fault-tolerant, high performance approach

P Dasgupta, ZM Kedem… - Proceedings of 15th …, 1995 - ieeexplore.ieee.org
One of the most sought after software innovation of this decade is the construction of
systems using off-the-shelf-workstations that actually deliver and even surpass, the power …

Modeling and tolerating heterogeneous failures in large parallel systems

E Heien, D Kondo, A Gainaru, D LaPine… - Proceedings of 2011 …, 2011 - dl.acm.org
As supercomputers and clusters increase in size and complexity, system failures are
inevitable. Different hardware components (such as memory, disk, or network) of such …

Calypso: A novel software system for fault-tolerant parallel processing on distributed platforms

A Baratloo, P Dasgupta… - Proceedings of the Fourth …, 1995 - ieeexplore.ieee.org
The importance of adapting networks of workstations for use as parallel processing
platforms is well established. However current solutions do not always address important …

Fault-tolerance considerations in large, multiple-processor systems

JG Kuhl, SM Reddy - Computer, 1986 - computer.org
R large, massively-parallel computing engines by interconnecting many conventional
processing elements to form an integrated supersystem. 1-3 The rapid expansion in very …

Super-scalable algorithms for computing on 100,000 processors

C Engelmann, A Geist - International Conference on Computational …, 2005 - Springer
In the next five years, the number of processors in high-end systems for scientific computing
is expected to rise to tens and even hundreds of thousands. For example, the IBM …

A fault tolerant protocol for massively parallel systems

S Chakravorty, LV Kalé - 18th International Parallel and …, 2004 - ieeexplore.ieee.org
Summary form only given. As parallel machines grow larger, the mean time between failure
shrinks. With the planned machines of near future, therefore, fault tolerance will become an …