Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing

D Göddeke, M Altenbernd, D Ribbrock - Parallel Computing, 2015 - Elsevier
We analyse novel fault tolerance schemes for data loss in multigrid solvers, which
essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To …

Soft fault detection and correction for multigrid

M Altenbernd, D Göddeke - The International Journal of High …, 2018 - journals.sagepub.com
We introduce a novel algorithm-based fault-tolerance scheme to detect and repair soft
transient faults (silent data corruption, bitflips) in multigrid solvers: by applying the full …

Algorithm-based checkpoint-recovery for the conjugate gradient method

C Pachajoa, C Pacher, M Levonyak… - Proceedings of the 49th …, 2020 - dl.acm.org
As computers reach exascale and beyond, the incidence of faults will increase. Solutions to
this problem are an active research topic. We focus on strategies to make the preconditioned …

[HTML][HTML] Subspace correction methods in algebraic multi-level frames

P Zaspel - Linear Algebra and its Applications, 2016 - Elsevier
This study aims at introducing new algebraic multi-level solution techniques for linear
systems with M-matrices. Previous optimal geometric constructions by multi-level generating …

[PDF][PDF] The Resiliency of Multilevel Methods on Next Generation Computing Platforms: Probabilistic Model and Its Analysis.

CA Glusa, M Ainsworth - 2018 - osti.gov
The reduced reliability of next generation exascale systems means that the resiliency
properties of a numerical algorithm will become an important factor in both the choice of …