Fail-stop failure algorithm-based fault tolerance for cholesky decomposition

D Hakkarinen, P Wu, Z Chen - IEEE Transactions on Parallel …, 2014 - ieeexplore.ieee.org
Cholesky decomposition is a widely used algorithm to solve linear equations with symmetric
and positive definite coefficient matrix. With large matrices, this often will be performed on …

Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance

E Yao, J Zhang, M Chen, G Tan… - … International Journal of …, 2015 - journals.sagepub.com
Soft errors in scientific computing applications are becoming inevitable with the ever-
increasing system scale and execution time, and new technologies that feature increased …

Phoenix: A substrate for resilient distributed graph analytics

R Dathathri, G Gill, L Hoang, K Pingali - Proceedings of the Twenty …, 2019 - dl.acm.org
This paper presents Phoenix, a communication and synchronization substrate that
implements a novel protocol for recovering from fail-stop faults when executing graph …

NR-MPI: a Non-stop and Fault Resilient MPI

G Suo, Y Lu, X Liao, M Xie… - … Conference on Parallel …, 2013 - ieeexplore.ieee.org
Fault resilience has became a major issue for HPC systems, in particular in the perspective
of future E-scale systems, which will consist of millions of CPU cores and other components …

Self-checkpoint: An in-memory checkpoint method using less space and its practice on fault-tolerant HPL

X Tang, J Zhai, B Yu, W Chen, W Zheng - Acm Sigplan Notices, 2017 - dl.acm.org
Fault tolerance is increasingly important in high performance computing due to the
substantial growth of system scale and decreasing system reliability. In-memory/diskless …

Efficient fault tolerance through dynamic node replacement

S Prabhakaran, M Neumann… - 2018 18th IEEE/ACM …, 2018 - ieeexplore.ieee.org
The mean time between failures of upcoming exascale systems is expected to be one hour
or less. To be able to successfully complete execution of applications in such scenarios …

An efficient in-memory checkpoint method and its practice on fault-tolerant HPL

X Tang, J Zhai, B Yu, W Chen… - IEEE Transactions on …, 2017 - ieeexplore.ieee.org
Fault tolerance is increasingly important in high-performance computing due to the
substantial growth of system scale and decreasing system reliability. In-memory/diskless …

Identifying patterns towards algorithm based fault tolerance

U Kabir, D Goswami - 2015 International Conference on High …, 2015 - ieeexplore.ieee.org
Checkpoint and recovery cost imposed by coordinated checkpoint/restart (CCP/R) is a
crucial performance issue for high performance computing (HPC) applications. In …

A case study of designing efficient algorithm-based fault tolerant application for exascale parallelism

E Yao, R Wang, M Chen, G Tan… - 2012 IEEE 26th …, 2012 - ieeexplore.ieee.org
Fault tolerance overhead of high performance computing (HPC) applications is becoming
critical to the efficient utilization of HPC systems at large scale. Today's HPC applications …

[PDF][PDF] Fault tolerant computation of hyperbolic partial differential equations with the sparse grid combination technique

TB Harding - 2016 - core.ac.uk
As the computing power of supercomputers continues to increase exponentially the mean
time between failures (mtbf) is decreasing. Checkpoint-restart has historically been the …