FTI: High performance fault tolerance interface for hybrid systems

L Bautista-Gomez, S Tsuboi, D Komatitsch… - Proceedings of 2011 …, 2011 - dl.acm.org
Large scientific applications deployed on current petascale systems expend a significant
amount of their execution time dumping checkpoint files to remote storage. New fault tolerant …

Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers

N Maruyama, T Nomura, K Sato… - Proceedings of 2011 …, 2011 - dl.acm.org
This paper proposes a compiler-based programming framework that automatically translates
user-written structured grid code into scalable parallel implementation code for GPU …

Efficient verification of real-time systems: Compact data structure and state-space reduction

KG Larsen, F Larsson, P Pettersson… - Proceedings Real-Time …, 1997 - ieeexplore.ieee.org
During the past few years, a number of verification tools have been developed for real-time
systems in the framework of timed automata (eg KRONOS and UPPAAL). One of the major …

Optimization of multi-level checkpoint model for large scale HPC applications

S Di, MS Bouguerra, L Bautista-Gomez… - 2014 IEEE 28th …, 2014 - ieeexplore.ieee.org
HPC community projects that future extreme scale systems will be much less stable than
current Petascale systems, thus requiring sophisticated fault tolerance to guarantee the …

Checkpointing strategies for parallel jobs

M Bougeret, H Casanova, M Rabie, Y Robert… - Proceedings of 2011 …, 2011 - dl.acm.org
This work provides an analysis of checkpointing strategies for minimizing expected job
execution times in an environment that is subject to processor failures. In the case of both …

Toward an optimal online checkpoint solution under a two-level HPC checkpoint model

S Di, Y Robert, F Vivien… - IEEE Transactions on …, 2016 - ieeexplore.ieee.org
The traditional single-level checkpointing method suffers from significant overhead on large-
scale platforms. Hence, multilevel checkpointing protocols have been studied extensively in …

CRUM: Checkpoint-restart support for CUDA's unified memory

R Garg, A Mohan, M Sullivan… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org
Unified Virtual Memory (UVM) was recently introduced with CUDA version 8 and the Pascal
GPU. The older CUDA programming style is akin to older large-memory UNIX applications …

Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing

MS Bouguerra, A Gainaru, LB Gomez… - 2013 IEEE 27th …, 2013 - ieeexplore.ieee.org
As the failure frequency is increasing with the components count in modern and future
supercomputers, resilience is becoming critical for extreme scale systems. The association …

Crac: Checkpoint-restart architecture for cuda with streams and uvm

T Jain, G Cooperman - SC20: International Conference for High …, 2020 - ieeexplore.ieee.org
The share of the top 500 supercomputers with NVIDIA GPUs is now over 25% and continues
to grow. While fault tolerance is a critical issue for supercomputing, there does not currently …

Asymmetric resilience: Exploiting task-level idempotency for transient error recovery in accelerator-based systems

J Leng, A Buyuktosunoglu, R Bertran… - … Symposium on High …, 2020 - ieeexplore.ieee.org
Accelerators make the task of building systems that are re-silient against transient errors like
voltage noise and soft errors hard. Architects integrate accelerators into the system as black …