gemV: A validated toolset for the early exploration of system reliability

K Tanikella, Y Koy, R Jeyapaul, K Lee… - 2016 IEEE 27th …, 2016 - ieeexplore.ieee.org
Decades of technology scaling has brought the threat of soft errors to modern embedded
processors. Though several methods have been proposed to protect systems from soft …

Measuring the impact of memory errors on application performance

M Gottscho, M Shoaib, S Govindan… - IEEE Computer …, 2016 - ieeexplore.ieee.org
Memory reliability is a key factor in the design of warehouse-scale computers. Prior work has
focused on the performance overheads of memory fault-tolerance schemes when errors do …

Soft error reliability predictor based on a Deep Feedforward Neural Network

DR Falcó, A Serrano-Cases… - 2020 IEEE Latin …, 2020 - ieeexplore.ieee.org
Statistical fault injection is a widely used methodology to early evaluation of soft error
reliability of microprocessor based systems. Due to the increasing complexity of the software …

[PDF][PDF] Characterizing the performance and scalability of many-core applications on virtualized platforms

X Song, H Chen, B Zang, X Song… - Parallel Processing …, 2010 - ipads.se.sjtu.edu.cn
Clouds have become attractive to applications, because of its low cost and on-demand
computing model with the use of virtualization technologies. With the continual increasing …

Reliability Assessment of the Open-source Many-core Processor OpenPiton

C Dammak - 2022 - spectrum.library.concordia.ca
The fast-growing demand for computational capacity has led to the emergence of large-
scale systems where the parallel processing capabilities of many-core processors have …

On the injection of hardware faults in virtualized multicore systems

M Cinque, A Pecchia - Journal of Parallel and Distributed Computing, 2017 - Elsevier
Virtualized multicore systems represent an emerging computing paradigm in the critical
systems industry. Virtualization-based solutions leverage the different cores of the processor …

Lessons learned from memory errors observed over the lifetime of Cielo

S Levy, KB Ferreira, N DeBardeleben… - … Conference for High …, 2018 - ieeexplore.ieee.org
Maintaining the performance of high-performance computing (HPC) applications as failures
increase is a major challenge for next-generation extreme-scale systems. Recent work …

Seer: a lightweight online failure prediction approach

B Ozcelik, C Yilmaz - IEEE Transactions on Software …, 2015 - ieeexplore.ieee.org
Online failure prediction approaches aim to predict the manifestation of failures at runtime
before the failures actually occur. Existing approaches generally refrain themselves from …

System-level effects of soft errors in uncore components

H Cho, E Cheng, T Shepherd… - IEEE Transactions on …, 2017 - ieeexplore.ieee.org
The effects of soft errors in processor cores have been widely studied. However, little has
been published about soft errors in uncore components, such as the memory subsystem and …

FluidCheck: A redundant threading-based approach for reliable execution in manycore processors

R Kalayappan, SR Sarangi - ACM Transactions on Architecture and …, 2015 - dl.acm.org
Soft errors have become a serious cause of concern with reducing feature sizes. The ability
to accommodate complex, Simultaneous Multithreading (SMT) cores on a single chip …