Fault-tolerant protocol for hybrid task-parallel message-passing applications

T Martsinkevich, O Subasi, O Unsal… - 2015 IEEE …, 2015 - ieeexplore.ieee.org
We present a fault-tolerant protocol for task-parallel message-passing applications to
mitigate transient errors. The protocol requires the restart only of the task that experienced …

Checkpointing and localized recovery for nested fork-join programs

C Fohry - arXiv preprint arXiv:2102.12941, 2021 - arxiv.org
While checkpointing is typically combined with a restart of the whole application, localized
recovery permits all but the affected processes to continue. In task-based cluster …

Marriage between coordinated and uncoordinated checkpointing for the exascale era

O Subasi, F Zyulkyarov, O Unsal… - 2015 IEEE 17th …, 2015 - ieeexplore.ieee.org
The state-of-the-art checkpointing techniques are projected to be prohibitively expensive in
the Exascale era. These techniques are most often holistic in nature which prevents them to …

A runtime heuristic to selectively replicate tasks for application-specific reliability targets

O Subasi, G Yalcin, F Zyulkyarov… - 2016 IEEE …, 2016 - ieeexplore.ieee.org
In this paper we propose a runtime-based selective task replication technique for task-
parallel high performance computing applications. Our selective task replication technique is …

FPGA architecture characterization for system level performance analysis

D Densmore, A Donlin… - Proceedings of the …, 2006 - ieeexplore.ieee.org
We present a modular and scalable approach for automatically extracting actual
performance information from a set of FPGA-based architecture topologies. This information …

Asynchronous and exact forward recovery for detected errors in iterative solvers

L Jaulmes, M Moreto, E Ayguade… - … on Parallel and …, 2018 - ieeexplore.ieee.org
Current trends and projections show that faults in computer systems become increasingly
common. Such errors may be detected, and possibly corrected transparently, eg, by Error …

Crc-based memory reliability for task-parallel HPC applications

O Subasi, O Unsal, J Labarta, G Yalcin… - 2016 IEEE …, 2016 - ieeexplore.ieee.org
Memory reliability will be one of the major concerns for future HPC and Exascale systems.
This concern is mostly attributed to the expected massive increase in memory capacity and …

Quantifying the impact of data replication on error propagation

Z Ozturk, HR Topcuoglu, MT Kandemir - Cluster Computing, 2023 - Springer
Various technological developments in the microprocessor world make modern computing
systems more vulnerable to soft errors than in the past, and consequently fault tolerance …

Towards distributed software resilience in asynchronous many-task programming models

N Gupta, JR Mayo, AS Lemoine… - 2020 IEEE/ACM 10th …, 2020 - ieeexplore.ieee.org
Exceptions and errors occurring within mission critical applications due to hardware failures
have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …

Implementing software resiliency in hpx for extreme scale computing

N Gupta, JR Mayo, AS Lemoine, H Kaiser - arXiv preprint arXiv …, 2020 - arxiv.org
Exceptions and errors occurring within mission critical applications due to hardware failures
have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …