Large scale debugging of parallel tasks with automaded

I Laguna, T Gamblin, BR de Supinski… - Proceedings of 2011 …, 2011 - dl.acm.org
Developing correct HPC applications continues to be a challenge as the number of cores
increases in today's largest systems. Most existing debugging techniques perform poorly at …

Report of the HPC Correctness Summit, Jan 25--26, 2017, Washington, DC

G Gopalakrishnan, PD Hovland, C Iancu… - arXiv preprint arXiv …, 2017 - arxiv.org
Maintaining leadership in HPC requires the ability to support simulations at large scales and
fidelity. In this study, we detail one of the most significant productivity challenges in …

Understanding the propagation of transient errors in HPC applications

RA Ashraf, R Gioiosa, G Kestor, RF DeMara… - Proceedings of the …, 2015 - dl.acm.org
Resiliency of exascale systems has quickly become an important concern for the scientific
community. Despite its importance, still much remains to be determined regarding how faults …

Designing and modelling selective replication for fault-tolerant hpc applications

O Subasi, G Yalcin, F Zyulkyarov… - 2017 17th IEEE/ACM …, 2017 - ieeexplore.ieee.org
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for
High Performance Computing (HPC) applications. There are studies that address fail-stop …

Evaluating the potential of multithreaded platforms for irregular scientific computations

J Nieplocha, A Márquez, J Feo… - Proceedings of the 4th …, 2007 - dl.acm.org
The resurgence of current and upcoming multithreaded architectures and programming
models led us to conduct a detailed study to understand the potential of these platforms to …

Static local concurrency errors detection in MPI-RMA programs

E Saillard, M Sergent, CTA Kaci… - 2022 IEEE/ACM Sixth …, 2022 - ieeexplore.ieee.org
Communications are a critical part of HPC simulations, and one of the main focuses of
application developers when scaling on supercomputers. While classical message passing …

Perfexpert: An easy-to-use performance diagnosis tool for hpc applications

M Burtscher, BD Kim, J Diamond… - SC'10: Proceedings …, 2010 - ieeexplore.ieee.org
HPC systems are notorious for operating at a small fraction of their peak performance, and
the ongoing migration to multi-core and multi-socket compute nodes further complicates …

A tunable, software-based DRAM error detection and correction library for HPC

D Fiala, KB Ferreira, F Mueller… - Euro-Par 2011: Parallel …, 2012 - Springer
Proposed exascale systems will present a number of considerable resiliency challenges. In
particular, DRAM soft-errors, or bit-flips, are expected to greatly increase due to the …

Low-cost program-level detectors for reducing silent data corruptions

SKS Hari, SV Adve, H Naeimi - IEEE/IFIP international …, 2012 - ieeexplore.ieee.org
With technology scaling, transient faults are becoming an increasing threat to hardware
reliability. Commodity systems must be made resilient to these in-field faults through very …

Fliptracker: Understanding natural error resilience in hpc applications

L Guo, D Li, I Laguna, M Schulz - … : International Conference for …, 2018 - ieeexplore.ieee.org
As high-performance computing systems scale in size and computational power, the danger
of silent errors, ie, errors that can bypass hardware detection mechanisms and impact …