Designing and modelling selective replication for fault-tolerant hpc applications

E Agullo, M Altenbernd, H Anzt… - … Journal of High …, 2022 - journals.sagepub.com

This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …

被引用次数：9 相关文章所有 22 个版本

Multiple fault-tolerance mechanisms in cloud systems: A systematic review

P Marcotte, F Grégoire, F Petrillo - 2019 IEEE International …, 2019 - ieeexplore.ieee.org

Cloud systems are progressively taking over today's software market. These typically require
constant operations with a minimum of failure. Multiple fault-tolerance mechanisms have …

被引用次数：11 相关文章所有 4 个版本

[PDF] nsf.gov

Canary: fault-tolerant faas for stateful time-sensitive applications

M Arif, K Assogba, MM Rafique - … : International Conference for …, 2022 - ieeexplore.ieee.org

Function-as-a-Service (FaaS) platforms have recently gained rapid popularity. Many stateful
applications have been migrated to FaaS platforms due to their ease of deployment …

被引用次数：6 相关文章所有 5 个版本

[PDF] ieee.org

Checkpointing workflows for fail-stop errors

L Han, LC Canon, H Casanova… - IEEE Transactions on …, 2018 - ieeexplore.ieee.org

We consider the problem of orchestrating the execution of workflow applications structured
as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail …

被引用次数：27 相关文章所有 22 个版本

[PDF] pitt.edu

Partial redundancy in hpc systems with non-uniform node reliabilities

Z Hussain, T Znati, R Melhem - SC18: International Conference …, 2018 - ieeexplore.ieee.org

We study the usefulness of partial redundancy in HPC message passing systems where
individual node failure distributions are not identical. Prior research works on fault tolerance …

被引用次数：22 相关文章所有 7 个版本

[PDF] upc.edu

Efficient selective replication of critical code regions for SDC mitigation leveraging redundant multithreading

S Arslan, O Unsal - The Journal of Supercomputing, 2021 - Springer

Redundant multithreading (RMT) is an effective reliability solution that provides thread-level
replication; however, it imposes additional overheads in terms of performance loss or energy …

被引用次数：8 相关文章所有 7 个版本

[PDF] osti.gov

Enabling resilience in asynchronous many-task programming models

SR Paul, A Hayashi, N Slattengren, H Kolla… - Euro-Par 2019: Parallel …, 2019 - Springer

Resilience is an imminent issue for next-generation platforms due to projected increases in
soft/transient failures as part of the inherent trade-offs among performance, energy, and …

被引用次数：13 相关文章所有 5 个版本

[PDF] academia.edu

MACORD: online adaptive machine learning framework for silent error detection

O Subasi, S Di, P Balaprakash, O Unsal… - 2017 IEEE …, 2017 - ieeexplore.ieee.org

Future high-performance computing (HPC) systems with ever-increasing resource capacity
(such as compute cores, memory and storage) may significantly increase the risks on …

被引用次数：16 相关文章所有 5 个版本

Task-level checkpointing for nested fork-join programs using work stealing

L Reitz, C Fohry - European Conference on Parallel Processing, 2023 - Springer

Recent Exascale supercomputers consist of millions of processing units, and this number is
still growing. Therefore, hardware failures, such as permanent node failures, become …

被引用次数：1 相关文章所有 2 个版本

[HTML] nih.gov

teaMPI—replication-based resilience without the (performance) pain

P Samfass, T Weinzierl, B Hazelwood… - … Conference, ISC High …, 2020 - Springer

In an era where we can not afford to checkpoint frequently, replication is a generic way
forward to construct numerical simulations that can continue to run even if hardware parts …

被引用次数：9 相关文章所有 11 个版本

高级搜索

QQ 群