C Fohry - arXiv preprint arXiv:2102.12941, 2021 - arxiv.org
While checkpointing is typically combined with a restart of the whole application, localized recovery permits all but the affected processes to continue. In task-based cluster …
The state-of-the-art checkpointing techniques are projected to be prohibitively expensive in the Exascale era. These techniques are most often holistic in nature which prevents them to …
In this paper we propose a runtime-based selective task replication technique for task- parallel high performance computing applications. Our selective task replication technique is …
D Densmore, A Donlin… - Proceedings of the …, 2006 - ieeexplore.ieee.org
We present a modular and scalable approach for automatically extracting actual performance information from a set of FPGA-based architecture topologies. This information …
Current trends and projections show that faults in computer systems become increasingly common. Such errors may be detected, and possibly corrected transparently, eg, by Error …
Memory reliability will be one of the major concerns for future HPC and Exascale systems. This concern is mostly attributed to the expected massive increase in memory capacity and …
Various technological developments in the microprocessor world make modern computing systems more vulnerable to soft errors than in the past, and consequently fault tolerance …
N Gupta, JR Mayo, AS Lemoine… - 2020 IEEE/ACM 10th …, 2020 - ieeexplore.ieee.org
Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …
N Gupta, JR Mayo, AS Lemoine, H Kaiser - arXiv preprint arXiv …, 2020 - arxiv.org
Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware …