Checkpointing workflows for fail-stop errors

L Han, LC Canon, H Casanova… - IEEE Transactions on …, 2018 - ieeexplore.ieee.org
We consider the problem of orchestrating the execution of workflow applications structured
as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail …

LADR: Low-cost application-level detector for reducing silent output corruptions

C Chen, G Eisenhauer, M Wolf, S Pande - Proceedings of the 27th …, 2018 - dl.acm.org
Applications running on future high performance computing (HPC) systems are more likely
to experience transient faults due to technology scaling trends with respect to higher circuit …

Optimal resilience patterns to cope with fail-stop and silent errors

A Benoit, A Cavelan, Y Robert… - 2016 IEEE International …, 2016 - ieeexplore.ieee.org
This work focuses on resilience techniques at extreme scale. Many papers deal with fail-stop
errors. Many others deal with silent errors (or silent data corruptions). But very few papers …

Spatial support vector regression to detect silent errors in the exascale era

O Subasi, S Di, L Bautista-Gomez… - 2016 16th IEEE/ACM …, 2016 - ieeexplore.ieee.org
As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …

Exploiting spatial smoothness in HPC applications to detect silent data corruption

L Bautista-Gomez, F Cappello - 2015 IEEE 17th International …, 2015 - ieeexplore.ieee.org
Next-generation supercomputers are expected to have more components and, at the same
time, consume several times less energy per operation. This situation is pushing …

Exploring the capabilities of support vector machines in detecting silent data corruptions

O Subasi, S Di, L Bautista-Gomez… - … Informatics and Systems, 2018 - Elsevier
As the exascale era approaches, the increasing capacity of high-performance computing
(HPC) systems with targeted power and energy budget goals introduces significant …

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

A Benoit, A Cavelan, F Cappello, P Raghavan… - Journal of Parallel and …, 2018 - Elsevier
This paper provides a model and an analytical study of replication as a technique to cope
with silent errors, as well as a mixture of both silent and fail-stop errors on large-scale …

A generic approach to scheduling and checkpointing workflows

L Han, V Le Fèvre, LC Canon, Y Robert… - Proceedings of the 47th …, 2018 - dl.acm.org
This work deals with scheduling and checkpointing strategies to execute scientific workflows
on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to …

A visual comparison of silent error propagation

Z Li, H Menon, K Mohror, S Liu, L Guo… - … on Visualization and …, 2022 - ieeexplore.ieee.org
High-performance computing (HPC) systems play a critical role in facilitating scientific
discoveries. Their scale and complexity (eg, the number of computational units and software …

Anomaly detection in scientific datasets using sparse representation

A Moon, M Kim, J Chen, SW Son - Proceedings of the First Workshop on …, 2023 - dl.acm.org
As the size and complexity of high-performance computing (HPC) systems keep growing,
scientists' ability to trust the data produced is paramount due to potential data corruption for …