Unified fault-tolerance framework for hybrid task-parallel message-passing applications

O Subasi, T Martsinkevich… - … Journal of High …, 2018 - journals.sagepub.com
We present a unified fault-tolerance framework for task-parallel message-passing
applications to mitigate transient errors. First, we propose a fault-tolerant message-logging …

Toward a general theory of optimal checkpoint placement

O Subasi, G Kestor… - 2017 IEEE International …, 2017 - ieeexplore.ieee.org
Checkpoint/restart has been widely used to cope with fail-stop errors. The checkpointing
frequency is most often optimized by assuming an exponential failure distribution. However …

Spatial support vector regression to detect silent errors in the exascale era

O Subasi, S Di, L Bautista-Gomez… - 2016 16th IEEE/ACM …, 2016 - ieeexplore.ieee.org
As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …

Exploring the capabilities of support vector machines in detecting silent data corruptions

O Subasi, S Di, L Bautista-Gomez… - … Informatics and Systems, 2018 - Elsevier
As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …

FPGA architecture characterization for system level performance analysis

D Densmore, A Donlin… - Proceedings of the …, 2006 - ieeexplore.ieee.org
We present a modular and scalable approach for automatically extracting actual
performance information from a set of FPGA-based architecture topologies. This information …

Incorporating fault-tolerance awareness into system-level modeling and simulation

T Johnson, H Lam - 2021 IEEE/ACM 11th Workshop on Fault …, 2021 - ieeexplore.ieee.org
As the design space for high-performance computer (HPC) systems grows larger and more
complex, modeling and simulation (MODSIM) techniques become more important to better …

Analysis of parallel application checkpoint storage for system configuration

B León, D Franco, D Rexachs, E Luque - The Journal of Supercomputing, 2021 - Springer
The use of fault tolerance strategies such as checkpoints is essential to maintain the
availability of systems and their applications in high-performance computing environments …

Enhancing Fault Tolerance in High-Performance Computing: A real hardware case study on a RISC-V Vector Processing Unit

M Barbirotta, F Minervini, CR Morales, A Cristal… - Authorea …, 2024 - techrxiv.org
High-Performance Computing (HPC) systems are designed for large-scale processing and
complex dataset analysis leveraging scalability, efficiency, and parallelism, often integrating …

On the theory of speculative checkpointing: time and energy considerations

O Subasi, S Krishnamoorthy - Proceedings of the 15th ACM International …, 2018 - dl.acm.org
Collective checkpoint/rollback is the most popular approach for dealing with fail-stop errors
on high-performance computing platforms. Prior work has focused on choosing checkpoint …

Gestión del Almacenamiento para Tolerancia a Fallos en Computación de Altas Prestaciones

B León Otero - 2023 - ddd.uab.cat
En entornos HPC es primordial mantener en continuo funcionamiento las aplicaciones que
implican gran tiempo de ejecución. La redundancia es uno de los métodos utilizados en …