With the rate of errors that silently effect an application's state/output expected to increase in future HPC machines, numerous mitigation schemes have been proposed, but little work …
A Das, F Mueller, B Rountree - 2021 IEEE International Parallel …, 2021 - ieeexplore.ieee.org
Production HPC clusters endure failures reducing computational capability and resource availability. Despite the presence of various failure prediction schemes for large-scale …
S Hukerikar, PC Diniz, RF Lucas - European Conference on Parallel …, 2013 - Springer
Redundancy both in space and time has been widely used to detect and in some cases correct errors in High Performance Computing (HPC) systems. With the HPC community …
V Kamath, R Giri, R Muralidhar - 2013 IEEE Sixth International …, 2013 - ieeexplore.ieee.org
Silicon Design and Electronic Design Automation (EDA) business is highly competitive and time to market is of utmost importance in the semiconductor industry where companies put in …
The era of petascale computing brought machines with hundreds of thousands of processors. The next generation of exascale supercomputers will make available clusters …
The current trend in high performance computing is to aggregate ever larger numbers of processing and interconnection elements in order to achieve desired levels of computational …
In this paper, we present a resilience analysis of the impact of soft errors on CLAMR, a hydrodynamics mini-app for high performance computing (HPC). We utilize F-SEFI, a fine …
MapReduce has become a popular programming paradigm in the domain of batch processing systems. Its simplicity allows applications to be highly scalable and to be easily …
S Yang, K Wu, Y Qiao, D Li… - 2017 IEEE International …, 2017 - ieeexplore.ieee.org
Fault tolerance is one of the major design goals for HPC. The emergence of non-volatile memories (NVM) provides a solution to build fault tolerant HPC. Data in NVM-based main …