Understanding silent data corruptions in a large production cpu population

S Wang, G Zhang, J Wei, Y Wang, J Wu… - Proceedings of the 29th …, 2023 - dl.acm.org
Silent Data Corruption (SDC) in processors can lead to various application-level issues,
such as incorrect calculations and even data loss. Since traditional techniques are not …

Harpocrates: Breaking the silence of cpu faults through hardware-in-the-loop program generation

N Karystinos, O Chatzopoulos… - 2024 ACM/IEEE 51st …, 2024 - ieeexplore.ieee.org
Several hyperscalers have recently disclosed the occurrence of Silent Data Corruptions
(SDCs) in their systems fleets, sparking concerns about the severity of known and the …

Understanding Silent Data Corruption in Processors for Mitigating its Effects

S Wang, G Zhang, J Wei, Y Wang, J Wu… - ACM Transactions on …, 2024 - dl.acm.org
Silent Data Corruption (SDC) in processors can lead to various application-level issues,
such as incorrect calculations and even data loss. Since traditional techniques are not …

Deep Soft Error Propagation Modeling Using Graph Attention Network

J Ma, Z Duan, L Tang - Journal of Electronic Testing, 2022 - Springer
Soft errors are increasing in computer systems due to shrinking feature sizes. Soft errors can
induce incorrect outputs, also called silent data corruption (SDC), which raises no warnings …

F_Radish: Enhancing Silent Data Corruption Detection for Aerospace-Based Computing

N Yang, Y Wang - Electronics, 2020 - mdpi.com
Radiation-induced soft errors degrade the reliability of aerospace-based computing. Silent
data corruption (SDC) is the most dangerous and insidious type of soft error result. To detect …

Characterization of Program Behavior under Faulty Instruction Encoding

J Ma, Z Duan, L Tang - Scientific Programming, 2022 - Wiley Online Library
As process technology scales, electronic devices become more susceptible to soft errors.
Soft errors can lead to silent data corruptions (SDCs), seriously compromising the reliability …

Predicting the soft error vulnerability of parallel applications using machine learning

I Öz, S Arslan - International Journal of Parallel Programming, 2021 - Springer
With the widespread use of the multicore systems having smaller transistor sizes, soft errors
become an important issue for parallel program execution. Fault injection is a prevalent …

A Checkpointing Recovery Approach for Soft Errors Based on Detector Locations

N Yang, Y Wang - Electronics, 2023 - mdpi.com
Soft errors are transient errors caused by single-event effects (SEEs) resulting from a strike
by high-energy particles acting on sensitive areas of integrated circuits. Soft errors frequently …

An Efficient Fault Tolerance Strategy for Multi-task MapReduce Models Using Coded Distributed Computing

Z Xie, J Zhang, Y Zhang, C Xu, P Chen, Z Qu… - … on Algorithms and …, 2023 - Springer
MapReduce is a programming framework designed for processing and analyzing large
volumes of data in a distributed computing environment. Despite its capabilities, it faces …

GATPS: An attention-based graph neural network for predicting SDC-causing instructions

J Ma, Z Duan, L Tang - 2021 IEEE 39th VLSI Test Symposium …, 2021 - ieeexplore.ieee.org
Soft errors can lead to silent data corruption (SDC), seriously compromising the reliability of
a system. To detect SDC, a profiling of SDC-causing instructions is usually needed to decide …