Partial redundancy in hpc systems with non-uniform node reliabilities

Z Hussain, T Znati, R Melhem - SC18: International Conference …, 2018 - ieeexplore.ieee.org
We study the usefulness of partial redundancy in HPC message passing systems where
individual node failure distributions are not identical. Prior research works on fault tolerance …

Color: Co-located rescuers for fault tolerance in hpc systems

Z Hussain, X Cui, T Znati… - 2018 IEEE 24th …, 2018 - ieeexplore.ieee.org
With the increase in scale of HPC systems, the frequency of system wide failures is expected
to increase. The performance of Coordinated Checkpoint/Restart (C/R), the traditional fault …

Combining partial redundancy and checkpointing for HPC

J Elliott, K Kharbas, D Fiala, F Mueller… - 2012 IEEE 32nd …, 2012 - ieeexplore.ieee.org
Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^ 15)
floating point operations per second) and exascale systems are projected within seven …

Assuming failure independence: are we right to be wrong?

G Aupy, Y Robert, F Vivien - 2017 IEEE International …, 2017 - ieeexplore.ieee.org
This paper revisits the failure 1 temporal independence hypothesis which is omnipresent in
the analysis of resilience methods for HPC. We explain why a previous approach is …

Optimizing HPC fault-tolerant environment: An analytical approach

H Jin, Y Chen, H Zhu, XH Sun - 2010 39th International …, 2010 - ieeexplore.ieee.org
The increasingly large ensemble size of modern High-Performance Computing (HPC)
systems has drastically increased the possibility of failures. Performance under failures and …

See applications run and throughput jump: The case for redundant computing in HPC

R Riesen, K Ferreira, J Stearley - … International Conference on …, 2010 - ieeexplore.ieee.org
For future parallel-computing systems with as few as twenty-thousand nodes we propose
redundant computing to reduce the number of application interrupts. The frequency of faults …

[PDF][PDF] Toward efficient failre detection and recovery in HPC

S Rani, C Leangsuksun, A Tikotekar… - High Availability and …, 2006 - Citeseer
Application outages due to node failures are common problems in high performance
computing. Reliability becomes a major issue, especially for long running jobs, as the …

Assessing energy efficiency of fault tolerance protocols for HPC systems

E Meneses, O Sarood, LV Kalé - 2012 IEEE 24th International …, 2012 - ieeexplore.ieee.org
An exascale machine is expected to be delivered in the time frame 2018-2020. Such a
machine will be able to tackle some of the hardest computational problems and to extend …

Dino: Divergent node cloning for sustained redundancy in hpc

A Rezaei, F Mueller, P Hargrove, E Roman - Journal of Parallel and …, 2017 - Elsevier
Complexity and scale of next generation HPC systems pose significant challenges in fault
resilience methods such that contemporary checkpoint/restart (C/R) methods that address …

[PDF][PDF] Reliability Analysis in HPC clusters

N Raju, YL Gottumukkala, CB Leangsuksun… - Proceedings of the High …, 2006 - Citeseer
Resource failures and down times have become a growing concern for large-scale
computational platforms, as they tend to have an adverse affect on the performance of the …