Fail-stop failure algorithm-based fault tolerance for cholesky decomposition

D Hakkarinen, P Wu, Z Chen - IEEE Transactions on Parallel …, 2014 - ieeexplore.ieee.org
Cholesky decomposition is a widely used algorithm to solve linear equations with symmetric
and positive definite coefficient matrix. With large matrices, this often will be performed on …

Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance

E Yao, J Zhang, M Chen, G Tan… - … International Journal of …, 2015 - journals.sagepub.com
Soft errors in scientific computing applications are becoming inevitable with the ever-
increasing system scale and execution time, and new technologies that feature increased …

Phoenix: A substrate for resilient distributed graph analytics

R Dathathri, G Gill, L Hoang, K Pingali - Proceedings of the Twenty …, 2019 - dl.acm.org
This paper presents Phoenix, a communication and synchronization substrate that
implements a novel protocol for recovering from fail-stop faults when executing graph …

Self-checkpoint: An in-memory checkpoint method using less space and its practice on fault-tolerant HPL

X Tang, J Zhai, B Yu, W Chen, W Zheng - Acm Sigplan Notices, 2017 - dl.acm.org
Fault tolerance is increasingly important in high performance computing due to the
substantial growth of system scale and decreasing system reliability. In-memory/diskless …

An efficient in-memory checkpoint method and its practice on fault-tolerant HPL

X Tang, J Zhai, B Yu, W Chen… - IEEE Transactions on …, 2017 - ieeexplore.ieee.org
Fault tolerance is increasingly important in high-performance computing due to the
substantial growth of system scale and decreasing system reliability. In-memory/diskless …

Parallel reduction to hessenberg form with algorithm-based fault tolerance

Y Jia, G Bosilca, P Luszczek, JJ Dongarra - Proceedings of the …, 2013 - dl.acm.org
This paper studies the resilience of a two-sided factorization and presents a generic
algorithm-based approach capable of making two-sided factorizations resilient. We establish …

A vision of post-exascale programming

JD Zhai, WG Chen - Frontiers of Information Technology & Electronic …, 2018 - Springer
Exascale systems have been under development for quite some time and will be available
for use in a few years. It is time to think about future post-exascale systems. There are many …

Identifying patterns towards algorithm based fault tolerance

U Kabir, D Goswami - 2015 International Conference on High …, 2015 - ieeexplore.ieee.org
Checkpoint and recovery cost imposed by coordinated checkpoint/restart (CCP/R) is a
crucial performance issue for high performance computing (HPC) applications. In …

Toward fault-tolerant parallel-in-time integration with PFASST

R Speck, D Ruprecht - Parallel computing, 2017 - Elsevier
We introduce and analyze different strategies for the parallel-in-time integration method
PFASST to recover from hard faults and subsequent data loss. Since PFASST stores …

[PDF][PDF] Fault tolerant computation of hyperbolic partial differential equations with the sparse grid combination technique

TB Harding - 2016 - core.ac.uk
As the computing power of supercomputers continues to increase exponentially the mean
time between failures (mtbf) is decreasing. Checkpoint-restart has historically been the …