Designing and modelling selective replication for fault-tolerant hpc applications

O Subasi, G Yalcin, F Zyulkyarov… - 2017 17th IEEE/ACM …, 2017 - ieeexplore.ieee.org
selective replication technique for HPC applications for both fail-stop errors and SDCs. Since
complete replication of applications … a runtime-based technique for selective replication. …

A runtime heuristic to selectively replicate tasks for application-specific reliability targets

O Subasi, G Yalcin, F Zyulkyarov… - 2016 IEEE …, 2016 - ieeexplore.ieee.org
… In this research, our key findings were: First, complete replication of HPC applications is
not required to mitigate the foreseen exascale error rates while achieving the same reliability …

Exploring partial replication to improve lightweight silent data corruption detection for HPC applications

E Berrocal, L Bautista-Gomez, S Di, Z Lan… - Euro-Par 2016: Parallel …, 2016 - Springer
… challenge for high-performance computing (HPC) applications as … able to detect SDC in HPC
applications to a certain level by … far from fully protecting applications to a level comparable …

Automatic risk-based selective redundancy for fault-tolerant task-parallel hpc applications

O Subasi, O Unsal, S Krishnamoorthy - Proceedings of the Third …, 2017 - dl.acm.org
applications to mitigate silent and fail-stop errors. To avoid the prohibitive costs of complete
replication, we introduce a lightweight selective replication … and selectively replicate only the …

Processor-level selective replication

N Nakka, K Pattabiraman, R Iyer - 37th Annual IEEE/IFIP …, 2007 - ieeexplore.ieee.org
… In this section, we show how the properties of the application are leveraged by selective
replication in order to identify what to replicate in the application. This analysis consists of three …

[PDF][PDF] Selective Process Replication for Fault Tolerance in Large-scale, Heterogeneous Environments with Non-Uniform Node Failure Distribution

L Li, T Znati, R Melhem - personales.upv.es
… not been widely adopted in HPC environments, due to its dependency on applications [15],
[… for HPC environment [24]. In this work, we study a practical selective replication strategy for …

Partial redundancy in hpc systems with non-uniform node reliabilities

Z Hussain, T Znati, R Melhem - … High Performance Computing …, 2018 - ieeexplore.ieee.org
selectively replicating tasks based on criticality[22][23][24]. These works replicate tasks from
an application … The idea of criticality is orthogonal to our task of selectively replicating nodes …

SmartApps: An application centric approach to high performance computing

L Rauchwerger, NM Amato, J Torrellas - International Workshop on …, 2000 - Springer
… needed for allocating replicated arrays across processors. CON… (rep), replicated buffer with
links (ll), selective privatization (sel)… Write method because iteration replication is very difficult …

Crc-based memory reliability for task-parallel HPC applications

O Subasi, O Unsal, J Labarta, G Yalcin… - 2016 IEEE …, 2016 - ieeexplore.ieee.org
replication vary from 0.1% (0.11% in total, Pingpong) to 7% (15% in total, FFT) showing that
replication … For more discussion and techniques on replication and selective replication, we …

Evaluating compiler ir-level selective instruction duplication with realistic hardware errors

CK Chang, G Li, M Erez - … on Fault Tolerance for HPC at …, 2019 - ieeexplore.ieee.org
… Since we focus on evaluating selective instruction duplication for application code, we do
not inject instructions within libraries (eg, libc and libm). Note that checking instructions are …