An exascale machine is expected to be delivered in the time frame 2018-2020. Such a machine will be able to tackle some of the hardest computational problems and to extend …
With the increase in scale of HPC systems, the frequency of system wide failures is expected to increase. The performance of Coordinated Checkpoint/Restart (C/R), the traditional fault …
The high failure rate expected for future supercomputers requires the design of new fault tolerant solutions. Most checkpointing protocols are designed to work with any message …
F Shahzad, M Wittmann, T Zeiser, G Wellein - Recent Advances in the …, 2012 - Springer
Abstract Checkpoint/restart (C/R) is a classical approach to introduce fault tolerance in large HPC applications. Although it is relatively easy as compared to other fault tolerance …
Failures are increasingly threatening the efficiency of HPC systems, and current projections of Exascale platforms indicate that rollback recovery, the most convenient method for …
N El-Sayed, B Schroeder - 2014 IEEE International Conference …, 2014 - ieeexplore.ieee.org
Efficient use of high-performance computing (HPC) installations critically relies on effective methods for fault tolerance. The most commonly used method is checkpoint/restart, where …
In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance …
IP Egwutuoha, D Levy, B Selic, S Chen - The Journal of Supercomputing, 2013 - Springer
Abstract In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take …
This paper revisits the failure 1 temporal independence hypothesis which is omnipresent in the analysis of resilience methods for HPC. We explain why a previous approach is …