Impact of voltage scaling on soft errors susceptibility of multicore server cpus

D Agiakatsikas, G Papadimitriou, V Karakostas… - Proceedings of the 56th …, 2023 - dl.acm.org
Microprocessor power consumption and dependability are both crucial challenges that
designers have to cope with due to shrinking feature sizes and increasing transistor counts …

Gem5-marvel: Microarchitecture-level resilience analysis of heterogeneous soc architectures

O Chatzopoulos, G Papadimitriou… - … Symposium on High …, 2024 - ieeexplore.ieee.org
In this paper, we present gem5-MARVEL, the first consolidated microarchitecture-level fault
injection infrastructure for heterogeneous System-on-Chip architectures comprising CPUs of …

Harpocrates: Breaking the silence of cpu faults through hardware-in-the-loop program generation

N Karystinos, O Chatzopoulos… - 2024 ACM/IEEE 51st …, 2024 - ieeexplore.ieee.org
Several hyperscalers have recently disclosed the occurrence of Silent Data Corruptions
(SDCs) in their systems fleets, sparking concerns about the severity of known and the …

Estimating the failures and silent errors rates of cpus across isas and microarchitectures

D Gizopoulos, G Papadimitriou… - … IEEE International Test …, 2023 - ieeexplore.ieee.org
Silent data corruptions (SDCs) pose a significant challenge to the reliable operation of
modern microprocessors. As the need for enhanced performance and reliability continues to …

Silent Data Corruptions in Computing Systems: Early Predictions and Large-Scale Measurements

D Gizopoulos, G Papadimitriou… - 2024 IEEE European …, 2024 - ieeexplore.ieee.org
Silent Data Corruptions (SDCs) due to defects in computing chips (CPUs, GPUs, AI
accelerators) is a critical threat to the quality of large-scale computing in different application …

Silent Data Corruptions in Computing: Understand and Quantify

T Macieira, S Gurumurthy, S Gurumurthi… - 2024 IEEE 30th …, 2024 - ieeexplore.ieee.org
Ensuring the reliability of hardware components is essential, particularly in large-scale
installations that demand computing capabilities (cloud data centers, supercomputers, edge …

[PDF][PDF] Soft Error Rate Measurements through ACE Analysis in TLB Structures of CPUs

KMI Sgouras - 2024 - pergamos.lib.uoa.gr
In recent years, there has been a decrease in the minimum feature size of the transistors in
integrated circuits. As a result, the vulnerability of CPU components has increased. An …

New Computer Evaluation Metrics for a Changing World

A Vahdat, X Ma, D Patterson - Communications of the ACM - dl.acm.org
New Computer Evaluation Metrics for a Changing World | Communications of the ACM skip to
main content ACM Digital Library home ACM Association for Computing Machinery corporate …

Microarchitecturally Exploring Fault-Tolerance and Timing on Silicon on Chip

V Singh, G Khan, A Ojha - 2024 International Conference on …, 2024 - ieeexplore.ieee.org
This paper presents a micro architectural exploration of fault-tolerance and timing behavior
on silicon on chip. Silicon on chip represents a full-size mission in designing reliable …

[PDF][PDF] GPU Reliability Assessment: Insights Across the Abstraction Layers

L Yang, G Papadimitriou, D Sartzetakis, A Jog, E Smirni… - lishanyang.github.io
Graphics Processing Units (GPUs) are widely deployed and utilized across various
computing domains including cloud and high-performance computing. Considering its …