A systematic survey on fault-tolerant solutions for distributed data analytics: Taxonomy, comparison, and future directions

S Isukapalli, SN Srirama - Computer Science Review, 2024 - Elsevier
Fault tolerance is becoming increasingly important for upcoming exascale systems,
supporting distributed data processing, due to the expected decrease in the Mean Time …

Fault‐tolerance approaches for distributed and cloud computing environments: A systematic review, taxonomy and future directions

M Kirti, AK Maurya, RS Yadav - Concurrency and Computation …, 2024 - Wiley Online Library
Fault tolerance is crucial in ensuring smooth working of distributed and cloud computing. It is
challenging to implement because of the constantly changing infrastructure and complex …

A highly reliable metadata service for large-scale distributed file systems

J Zhou, Y Chen, W Wang, S He… - IEEE Transactions on …, 2019 - ieeexplore.ieee.org
Many massive data processing applications nowadays often need long, continuous, and
uninterrupted data accesses. Distributed file systems are used as the back-end storage to …

Resiliency in numerical algorithm design for extreme scale simulations

E Agullo, M Altenbernd, H Anzt… - … Journal of High …, 2022 - journals.sagepub.com
This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …

Rapid Recover Map Reduce (RR-MR): Boosting failure recovery in Big Data applications

S Chorey, N Sahu - Journal of Integrated Science and …, 2024 - pubs.thesciencein.org
The rapid growth of Big Data applications has brought forth unprecedented opportunities for
insights and innovation, but it has also exposed the inherent vulnerabilities of data …

Adaptive erasure coded fault tolerant linear system solver

X Kang, DF Gleich, A Sameh, A Grama - ACM Transactions on Parallel …, 2021 - dl.acm.org
As parallel and distributed systems scale, fault tolerance is an increasingly important
problem—particularly on systems with limited I/O capacity and bandwidth. Erasure coded …

Random Pattern Generation and Redundancy Analysis in Memories

KB Anudeep, DJ Jagannath, S Radha… - 2022 IEEE 11th …, 2022 - ieeexplore.ieee.org
Designing a memory is a challenge by itself as the storage and fault occurrence is quite
common in memories. Hence, fault occurrence and fault detection, corrections are major …

Middleware to Manage Fault Tolerance Using Semi-Coordinated Checkpoints

A Wong, E Heymann, D Rexachs… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Compute node failures are becoming a normal event for many long-running and scalable
MPI applications. Keeping within the MPI standards and applying some of the methods …

Resiliency in Numerical Algorithm Design for Extreme Scale Simulations (Dagstuhl Seminar 20101)

L Giraud, U Rüde, L Stals - 2020 - drops.dagstuhl.de
This work is based on the seminar titled" Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations" held March 1-6, 2020 at Schloss Dagstuhl, that was attended by …

Lineage chain mark fault-tolerant method for micro-batching monitoring data in distribution power network

Z Qu, H Wang, X Peng, Q Wang - IEEE Access, 2019 - ieeexplore.ieee.org
Aiming at the problem of lacking efficient distributed fault tolerant mechanism for data
explosion in the distributed distribution power automation system, based on the record-level …