Nanocheckpoints: A task-based asynchronous dataflow framework for efficient and scalable...

T Martsinkevich, O Subasi, O Unsal… - 2015 IEEE …, 2015 - ieeexplore.ieee.org

We present a fault-tolerant protocol for task-parallel message-passing applications to
mitigate transient errors. The protocol requires the restart only of the task that experienced …

被引用次数：24 相关文章所有 7 个版本

[PDF] arxiv.org

Checkpointing and localized recovery for nested fork-join programs

C Fohry - arXiv preprint arXiv:2102.12941, 2021 - arxiv.org

While checkpointing is typically combined with a restart of the whole application, localized
recovery permits all but the affected processes to continue. In task-based cluster …

被引用次数：6 相关文章所有 2 个版本

Marriage between coordinated and uncoordinated checkpointing for the exascale era

O Subasi, F Zyulkyarov, O Unsal… - 2015 IEEE 17th …, 2015 - ieeexplore.ieee.org

The state-of-the-art checkpointing techniques are projected to be prohibitively expensive in
the Exascale era. These techniques are most often holistic in nature which prevents them to …

被引用次数：12 相关文章所有 4 个版本

[PDF] agu.edu.tr

A runtime heuristic to selectively replicate tasks for application-specific reliability targets

O Subasi, G Yalcin, F Zyulkyarov… - 2016 IEEE …, 2016 - ieeexplore.ieee.org

In this paper we propose a runtime-based selective task replication technique for task-
parallel high performance computing applications. Our selective task replication technique is …

被引用次数：12 相关文章所有 9 个版本

[PDF] uci.edu

FPGA architecture characterization for system level performance analysis

D Densmore, A Donlin… - Proceedings of the …, 2006 - ieeexplore.ieee.org

We present a modular and scalable approach for automatically extracting actual
performance information from a set of FPGA-based architecture topologies. This information …

被引用次数：28 相关文章所有 14 个版本

[PDF] upc.edu

Asynchronous and exact forward recovery for detected errors in iterative solvers

L Jaulmes, M Moreto, E Ayguade… - … on Parallel and …, 2018 - ieeexplore.ieee.org

Current trends and projections show that faults in computer systems become increasingly
common. Such errors may be detected, and possibly corrected transparently, eg, by Error …

被引用次数：8 相关文章所有 10 个版本

Crc-based memory reliability for task-parallel HPC applications

O Subasi, O Unsal, J Labarta, G Yalcin… - 2016 IEEE …, 2016 - ieeexplore.ieee.org

Memory reliability will be one of the major concerns for future HPC and Exascale systems.
This concern is mostly attributed to the expected massive increase in memory capacity and …

被引用次数：11 相关文章所有 7 个版本

Quantifying the impact of data replication on error propagation

Z Ozturk, HR Topcuoglu, MT Kandemir - Cluster Computing, 2023 - Springer

Various technological developments in the microprocessor world make modern computing
systems more vulnerable to soft errors than in the past, and consequently fault tolerance …

Towards distributed software resilience in asynchronous many-task programming models

N Gupta, JR Mayo, AS Lemoine… - 2020 IEEE/ACM 10th …, 2020 - ieeexplore.ieee.org

Exceptions and errors occurring within mission critical applications due to hardware failures
have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …

被引用次数：4 相关文章所有 6 个版本

[PDF] arxiv.org

Implementing software resiliency in hpx for extreme scale computing

N Gupta, JR Mayo, AS Lemoine, H Kaiser - arXiv preprint arXiv …, 2020 - arxiv.org

Exceptions and errors occurring within mission critical applications due to hardware failures
have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …

被引用次数：4 相关文章所有 3 个版本

高级搜索

QQ 群