Resiliency in numerical algorithm design for extreme scale simulations

E Agullo, M Altenbernd, H Anzt… - … Journal of High …, 2022 - journals.sagepub.com
This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …

Designing and modelling selective replication for fault-tolerant hpc applications

O Subasi, G Yalcin, F Zyulkyarov… - 2017 17th IEEE/ACM …, 2017 - ieeexplore.ieee.org
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for
High Performance Computing (HPC) applications. There are studies that address fail-stop …

Partial redundancy in hpc systems with non-uniform node reliabilities

Z Hussain, T Znati, R Melhem - SC18: International Conference …, 2018 - ieeexplore.ieee.org
We study the usefulness of partial redundancy in HPC message passing systems where
individual node failure distributions are not identical. Prior research works on fault tolerance …

MACORD: online adaptive machine learning framework for silent error detection

O Subasi, S Di, P Balaprakash, O Unsal… - 2017 IEEE …, 2017 - ieeexplore.ieee.org
Future high-performance computing (HPC) systems with ever-increasing resource capacity
(such as compute cores, memory and storage) may significantly increase the risks on …

Automatic risk-based selective redundancy for fault-tolerant task-parallel hpc applications

O Subasi, O Unsal, S Krishnamoorthy - Proceedings of the Third …, 2017 - dl.acm.org
Silent data corruption (SDC) and fail-stop errors are the most hazardous error types in high-
performance computing (HPC) systems. In this study, we present an automatic, efficient and …

Towards distributed software resilience in asynchronous many-task programming models

N Gupta, JR Mayo, AS Lemoine… - 2020 IEEE/ACM 10th …, 2020 - ieeexplore.ieee.org
Exceptions and errors occurring within mission critical applications due to hardware failures
have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …

Implementing software resiliency in hpx for extreme scale computing

N Gupta, JR Mayo, AS Lemoine, H Kaiser - arXiv preprint arXiv …, 2020 - arxiv.org
Exceptions and errors occurring within mission critical applications due to hardware failures
have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …

Resiliency in Numerical Algorithm Design for Extreme Scale Simulations (Dagstuhl Seminar 20101)

L Giraud, U Rüde, L Stals - 2020 - drops.dagstuhl.de
This work is based on the seminar titled" Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations" held March 1-6, 2020 at Schloss Dagstuhl, that was attended by …

[图书][B] Heterogeneity aware fault tolerance for extreme scale computing

Z Hussain - 2020 - search.proquest.com
Abstract Upcoming Extreme Scale, or Exascale, Computing Systems are expected to deliver
a peak performance of at least 10 18 floating point operations per second (FLOPS), primarily …

A TDMA based scheduling scheme in 802.11 b WLANs with access point

O Abu-Sharkh, AH Tewfik - … and Computing (ITCC'05)-Volume II, 2005 - ieeexplore.ieee.org
This paper introduces a new scheduling scheme that provides fair access to all stations in
802.11 b WLANs. The scheme divides the transmission opportunities between wireless …