Maintaining leadership in HPC requires the ability to support simulations at large scales and fidelity. In this study, we detail one of the most significant productivity challenges in …
Resiliency of exascale systems has quickly become an important concern for the scientific community. Despite its importance, still much remains to be determined regarding how faults …
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop …
J Nieplocha, A Márquez, J Feo… - Proceedings of the 4th …, 2007 - dl.acm.org
The resurgence of current and upcoming multithreaded architectures and programming models led us to conduct a detailed study to understand the potential of these platforms to …
E Saillard, M Sergent, CTA Kaci… - 2022 IEEE/ACM Sixth …, 2022 - ieeexplore.ieee.org
Communications are a critical part of HPC simulations, and one of the main focuses of application developers when scaling on supercomputers. While classical message passing …
M Burtscher, BD Kim, J Diamond… - SC'10: Proceedings …, 2010 - ieeexplore.ieee.org
HPC systems are notorious for operating at a small fraction of their peak performance, and the ongoing migration to multi-core and multi-socket compute nodes further complicates …
Proposed exascale systems will present a number of considerable resiliency challenges. In particular, DRAM soft-errors, or bit-flips, are expected to greatly increase due to the …
SKS Hari, SV Adve, H Naeimi - IEEE/IFIP international …, 2012 - ieeexplore.ieee.org
With technology scaling, transient faults are becoming an increasing threat to hardware reliability. Commodity systems must be made resilient to these in-field faults through very …
As high-performance computing systems scale in size and computational power, the danger of silent errors, ie, errors that can bypass hardware detection mechanisms and impact …