With the increase in scale of HPC systems, the frequency of system wide failures is expected to increase. The performance of Coordinated Checkpoint/Restart (C/R), the traditional fault …
J Elliott, K Kharbas, D Fiala, F Mueller… - 2012 IEEE 32nd …, 2012 - ieeexplore.ieee.org
Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^ 15) floating point operations per second) and exascale systems are projected within seven …
This paper revisits the failure 1 temporal independence hypothesis which is omnipresent in the analysis of resilience methods for HPC. We explain why a previous approach is …
The increasingly large ensemble size of modern High-Performance Computing (HPC) systems has drastically increased the possibility of failures. Performance under failures and …
For future parallel-computing systems with as few as twenty-thousand nodes we propose redundant computing to reduce the number of application interrupts. The frequency of faults …
S Rani, C Leangsuksun, A Tikotekar… - High Availability and …, 2006 - Citeseer
Application outages due to node failures are common problems in high performance computing. Reliability becomes a major issue, especially for long running jobs, as the …
An exascale machine is expected to be delivered in the time frame 2018-2020. Such a machine will be able to tackle some of the hardest computational problems and to extend …
Complexity and scale of next generation HPC systems pose significant challenges in fault resilience methods such that contemporary checkpoint/restart (C/R) methods that address …
N Raju, YL Gottumukkala, CB Leangsuksun… - Proceedings of the High …, 2006 - Citeseer
Resource failures and down times have become a growing concern for large-scale computational platforms, as they tend to have an adverse affect on the performance of the …