UnSync: A soft error resilient redundant multicore architecture

R Jeyapaul, F Hong, A Rhisheekesan… - 2011 International …, 2011 - ieeexplore.ieee.org
Reducing device dimensions, increasing transistor densities, and smaller timing windows,
expose the vulnerability of processors to soft errors induced by charge carrying particles …

Understanding the propagation of error due to a silent data corruption in a sparse matrix vector multiply

J Calhoun, M Snir, L Olson… - 2015 IEEE International …, 2015 - ieeexplore.ieee.org
With the rate of errors that silently effect an application's state/output expected to increase in
future HPC machines, numerous mitigation schemes have been proposed, but little work …

Systemic assessment of node failures in HPC production platforms

A Das, F Mueller, B Rountree - 2021 IEEE International Parallel …, 2021 - ieeexplore.ieee.org
Production HPC clusters endure failures reducing computational capability and resource
availability. Despite the presence of various failure prediction schemes for large-scale …

A case for adaptive redundancy for HPC resilience

S Hukerikar, PC Diniz, RF Lucas - European Conference on Parallel …, 2013 - Springer
Redundancy both in space and time has been widely used to detect and in some cases
correct errors in High Performance Computing (HPC) systems. With the HPC community …

Experiences with a private enterprise cloud: Providing fault tolerance and high availability for interactive eda applications

V Kamath, R Giri, R Muralidhar - 2013 IEEE Sixth International …, 2013 - ieeexplore.ieee.org
Silicon Design and Electronic Design Automation (EDA) business is highly competitive and
time to market is of utmost importance in the semiconductor industry where companies put in …

Evaluation of simple causal message logging for large-scale fault tolerant HPC systems

E Meneses, G Bronevetsky… - 2011 IEEE International …, 2011 - ieeexplore.ieee.org
The era of petascale computing brought machines with hundreds of thousands of
processors. The next generation of exascale supercomputers will make available clusters …

Using probabilistic characterization to reduce runtime faults in HPC systems

J Brandt, B Debusschere, A Gentile… - 2008 Eighth IEEE …, 2008 - ieeexplore.ieee.org
The current trend in high performance computing is to aggregate ever larger numbers of
processing and interconnection elements in order to achieve desired levels of computational …

Fault injection experiments with the clamr hydrodynamics mini-app

B Atkinson, N Debardeleben, Q Guan… - 2014 IEEE …, 2014 - ieeexplore.ieee.org
In this paper, we present a resilience analysis of the impact of soft errors on CLAMR, a
hydrodynamics mini-app for high performance computing (HPC). We utilize F-SEFI, a fine …

Active replication at (almost) no cost

A Martin, C Fetzer, A Brito - 2011 IEEE 30th International …, 2011 - ieeexplore.ieee.org
MapReduce has become a popular programming paradigm in the domain of batch
processing systems. Its simplicity allows applications to be highly scalable and to be easily …

Algorithm-directed crash consistence in non-volatile memory for hpc

S Yang, K Wu, Y Qiao, D Li… - 2017 IEEE International …, 2017 - ieeexplore.ieee.org
Fault tolerance is one of the major design goals for HPC. The emergence of non-volatile
memories (NVM) provides a solution to build fault tolerant HPC. Data in NVM-based main …