Building algorithmically nonstop fault tolerant MPI programs

D Hakkarinen, P Wu, Z Chen - IEEE Transactions on Parallel …, 2014 - ieeexplore.ieee.org

Cholesky decomposition is a widely used algorithm to solve linear equations with symmetric
and positive definite coefficient matrix. With large matrices, this often will be performed on …

被引用次数：50 相关文章所有 5 个版本

Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance

E Yao, J Zhang, M Chen, G Tan… - … International Journal of …, 2015 - journals.sagepub.com

Soft errors in scientific computing applications are becoming inevitable with the ever-
increasing system scale and execution time, and new technologies that feature increased …

被引用次数：23 相关文章所有 5 个版本

[PDF] acm.org

Phoenix: A substrate for resilient distributed graph analytics

R Dathathri, G Gill, L Hoang, K Pingali - Proceedings of the Twenty …, 2019 - dl.acm.org

This paper presents Phoenix, a communication and synchronization substrate that
implements a novel protocol for recovering from fail-stop faults when executing graph …

被引用次数：15 相关文章所有 5 个版本

NR-MPI: a Non-stop and Fault Resilient MPI

G Suo, Y Lu, X Liao, M Xie… - … Conference on Parallel …, 2013 - ieeexplore.ieee.org

Fault resilience has became a major issue for HPC systems, in particular in the perspective
of future E-scale systems, which will consist of millions of CPU cores and other components …

被引用次数：20 相关文章所有 5 个版本

[PDF] tsinghua.edu.cn

Self-checkpoint: An in-memory checkpoint method using less space and its practice on fault-tolerant HPL

X Tang, J Zhai, B Yu, W Chen, W Zheng - Acm Sigplan Notices, 2017 - dl.acm.org

Fault tolerance is increasingly important in high performance computing due to the
substantial growth of system scale and decreasing system reliability. In-memory/diskless …

被引用次数：15 相关文章所有 5 个版本

[PDF] fz-juelich.de

Efficient fault tolerance through dynamic node replacement

S Prabhakaran, M Neumann… - 2018 18th IEEE/ACM …, 2018 - ieeexplore.ieee.org

The mean time between failures of upcoming exascale systems is expected to be one hour
or less. To be able to successfully complete execution of applications in such scenarios …

被引用次数：12 相关文章所有 5 个版本

An efficient in-memory checkpoint method and its practice on fault-tolerant HPL

X Tang, J Zhai, B Yu, W Chen… - IEEE Transactions on …, 2017 - ieeexplore.ieee.org

Fault tolerance is increasingly important in high-performance computing due to the
substantial growth of system scale and decreasing system reliability. In-memory/diskless …

被引用次数：13 相关文章所有 2 个版本

Identifying patterns towards algorithm based fault tolerance

U Kabir, D Goswami - 2015 International Conference on High …, 2015 - ieeexplore.ieee.org

Checkpoint and recovery cost imposed by coordinated checkpoint/restart (CCP/R) is a
crucial performance issue for high performance computing (HPC) applications. In …

被引用次数：12 相关文章所有 3 个版本

[PDF] semanticscholar.org

A case study of designing efficient algorithm-based fault tolerant application for exascale parallelism

E Yao, R Wang, M Chen, G Tan… - 2012 IEEE 26th …, 2012 - ieeexplore.ieee.org

Fault tolerance overhead of high performance computing (HPC) applications is becoming
critical to the efficient utilization of HPC systems at large scale. Today's HPC applications …

被引用次数：15 相关文章所有 6 个版本

[PDF] core.ac.uk

[PDF][PDF] Fault tolerant computation of hyperbolic partial differential equations with the sparse grid combination technique

TB Harding - 2016 - core.ac.uk

As the computing power of supercomputers continues to increase exponentially the mean
time between failures (mtbf) is decreasing. Checkpoint-restart has historically been the …

被引用次数：9 相关文章所有 2 个版本

高级搜索

QQ 群