We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging …
O Subasi, G Kestor… - 2017 IEEE International …, 2017 - ieeexplore.ieee.org
Checkpoint/restart has been widely used to cope with fail-stop errors. The checkpointing frequency is most often optimized by assuming an exponential failure distribution. However …
As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant …
As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant …
D Densmore, A Donlin… - Proceedings of the …, 2006 - ieeexplore.ieee.org
We present a modular and scalable approach for automatically extracting actual performance information from a set of FPGA-based architecture topologies. This information …
T Johnson, H Lam - 2021 IEEE/ACM 11th Workshop on Fault …, 2021 - ieeexplore.ieee.org
As the design space for high-performance computer (HPC) systems grows larger and more complex, modeling and simulation (MODSIM) techniques become more important to better …
The use of fault tolerance strategies such as checkpoints is essential to maintain the availability of systems and their applications in high-performance computing environments …
High-Performance Computing (HPC) systems are designed for large-scale processing and complex dataset analysis leveraging scalability, efficiency, and parallelism, often integrating …
Collective checkpoint/rollback is the most popular approach for dealing with fail-stop errors on high-performance computing platforms. Prior work has focused on choosing checkpoint …