Abstract Asynchronous Many-Task (AMT) runtimes have recently been proposed as a promising software foundation for managing the increasing complexity of node architectures …
Checkpointing is the most widely used technique in high-performance computing (HPC) to ensure the application progress in the presence of failures. In this paper, we present …
MC Lee, JC Lin, O Owe - 2018 IEEE 32nd International …, 2018 - ieeexplore.ieee.org
Today the Internet offers a massive amount of reviews and user experiences about a variety of products from different manufacturers, ranging from smartphones, automobiles, and home …
Undetected soft errors caused by transient bit flips can lead to silent data corruption (SDC), an undesirable outcome where invalid results pass for valid ones. This has motivated the …
This thesis focuses on a major problem for the HPC community: resilience. Computing platforms are bigger and bigger in order to reach what we call exascale, ie a computing …