E Yao, M Chen, R Wang, W Zhang, G Tan - arXiv preprint arXiv:1106.4213, 2011 - arxiv.org
Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. HPC applications typically …
F Shahzad, M Wittmann, M Kreutzer… - Parallel Processing …, 2013 - World Scientific
The road to exascale computing poses many challenges for the High Performance Computing (HPC) community. Each step on the exascale path is mainly the result of a higher …
As the number of processors in today's parallel systems continues to grow, the mean-time-to- failure of these systems is becoming significantly shorter than the execution time of many …
High Performance Computing (HPC) systems represent the peak of modern computational capability. As ever-increasing demands for computational power have fuelled the demand …
K Mohror, A Moody… - IEEE/IFIP International …, 2012 - ieeexplore.ieee.org
Applications running on today's supercomputers tolerate failures by periodically saving their state in checkpoint files on stable storage, such as a parallel file system. Although this …
JP Walters, V Chaudhary - International Conference on High-Performance …, 2007 - Springer
As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques …
Most predictions of Exascale machines picture billion way parallelism, encompassing not only millions of cores, but also tens of thousands of nodes. Even considering extremely …
The execution times of large-scale parallel applications on nowadays multi/many-core systems are usually longer than the mean time between failures. Therefore, parallel …
JP Walters, V Chaudhary - IEEE Transactions on Parallel and …, 2008 - ieeexplore.ieee.org
As computational clusters increase in size, their mean time to failure reduces drastically. Typically, checkpointing is used to minimize the loss of computation. Most checkpointing …