A case study of application structure aware resilience through differentiated state saving and recovery

A Dubey, H Fujita, Z Rubenstein… - Euro-Par 2015: Parallel …, 2015 - Springer
Resilience is a growing concern for large-scale simulations. As failures become more
frequent, alternatives to global checkpointing that limit the extent of needed recovery …

A performance debugger for a language supporting data distribution primitives

G Hurteau, A Singh, M Hancu… - … Processing Vol. 2, 1994 - ieeexplore.ieee.org
Data parallel languages based on user-supplied data distribution directives significantly
simplify the development of the initial version of a parallel application. However, selection of …

Improving data integrity in linux software RAID with protection information (T10-PI)

B Zhang, RR Chandrasekar… - 2018 18th IEEE/ACM …, 2018 - ieeexplore.ieee.org
The T10 DIF (Data Integrity Field) and DIX (Data Integrity Extension) specifications provide
mechanisms to guarantee end-to-end data integrity and protection in the face of silent data …

Development of an Algorithm for Detection and Recovery of Corruption in Convolutional Neural Networks Data Storage

M Ramzanpour - 2021 - search.proquest.com
Computer vision based applications are commonly utilized in embedded systems. The
demand for higher accuracy leads to increased complexity of convolutional neural networks …

Towards Resilience Methods for Simulation Applications based on Actor Replication

M Schnaus - 2021 - mediatum.ub.tum.de
High-performance computing is an important field of scientific computing with many
problems offering the possibility of achieving speedups through high levels of …

Addressing Fault Tolerance for Staging Based Scientific Workflows

S Duan - 2020 - search.proquest.com
In-situ scientific workflows, ie, executing the entire application workflows on the HPC system,
have emerged as an attractive approach to address data-related challenges by moving …

A TDMA based scheduling scheme in 802.11 b WLANs with access point

O Abu-Sharkh, AH Tewfik - … and Computing (ITCC'05)-Volume II, 2005 - ieeexplore.ieee.org
This paper introduces a new scheduling scheme that provides fair access to all stations in
802.11 b WLANs. The scheme divides the transmission opportunities between wireless …

SEDAR: Detección y recuperación automática de fallos transitorios en sistemas de cómputo de altas prestaciones

DM Montezanti - 2020 - sedici.unlp.edu.ar
El manejo de fallos es una preocupación creciente en el contexto del HPC; en el futuro, se
esperan mayores variedades y tasas de errores, intervalos de detección más largos y fallos …

F_Radish: Enhancing Silent Data Corruption Detection for Aerospace-Based Computing. Electronics 2021, 10, 61

N Yang, Y Wang - 2020 - search.proquest.com
Radiation-induced soft errors degrade the reliability of aerospace-based computing. Silent
data corruption (SDC) is the most dangerous and insidious type of soft error result. To detect …

On the Detection of Silent Data Corruptions in HPC Applications Using Redundant Multi-threading

D Pérez, T Ropars, E Meneses - European Conference on Parallel …, 2020 - Springer
This paper studies the use of Redundant Multi-Threading (RMT) to detect Silent Data
Corruptions in HPC applications. To understand if it can be a viable solution in an HPC …