Silent data errors: Sources, detection, and modeling

A Singh, S Chakravarty, G Papadimitriou… - 2023 IEEE 41st VLSI …, 2023 - ieeexplore.ieee.org
Chip manufacturers and hyperscalers are becoming increasingly aware of the problem
posed by Silent Data Errors (SDE) and are taking steps to address it. Major computing …

Silent data corruptions: The stealthy saboteurs of digital integrity

G Papadimitriou, D Gizopoulos… - 2023 IEEE 29th …, 2023 - ieeexplore.ieee.org
Silent Data Corruptions (SDCs) pose a significant threat to the integrity of digital systems.
These stealthy saboteurs silently corrupt data, remaining undetected by traditional error …

Estimating the failures and silent errors rates of cpus across isas and microarchitectures

D Gizopoulos, G Papadimitriou… - … IEEE International Test …, 2023 - ieeexplore.ieee.org
Silent data corruptions (SDCs) pose a significant challenge to the reliable operation of
modern microprocessors. As the need for enhanced performance and reliability continues to …

Hard drives monitoring automation approach for Kubernetes container orchestration system

AS Shemyakinskaya, IV Nikiforov - Труды института системного …, 2020 - mathnet.ru
Today, a laborious and non-trivial task is to automate monitoring of hard drives in a cluster
infrastructure using the Kubernetes container management system. The paper discusses …

Failure prediction in datacenters using unsupervised multimodal anomaly detection

M Zhao, R Furuhata, M Agung… - … Conference on Big …, 2020 - ieeexplore.ieee.org
Predicting hard drive failures in datacenters can help avoid wasting resources and waiting
time for recovery. Anomaly detection from sensing data is commonly used for predicting …

Design and Evaluation of a Peripheral for Integrity Checking to Improve RAS in RISC-V Architectures

D Rossi, N Canino, S Di Matteo… - 2023 8th South-East …, 2023 - ieeexplore.ieee.org
This paper presents a peripheral to check for integrity against errors affecting memories in
RISC-V architectures. A HW-SW Interface for Error Logging and Reporting to improve …

HW-SW interface design and implementation for error logging and reporting for RAS improvement

N Canino, S Di Matteo, D Rossi, S Saponara - IEEE Access, 2024 - ieeexplore.ieee.org
When designing a resilient computing system, the desired degree of Reliability, Availability,
and Serviceability (RAS) must be assessed and guaranteed. This article presents a …

Silent Data Corruptions in Computing Systems: Early Predictions and Large-Scale Measurements

D Gizopoulos, G Papadimitriou… - 2024 IEEE European …, 2024 - ieeexplore.ieee.org
Silent Data Corruptions (SDCs) due to defects in computing chips (CPUs, GPUs, AI
accelerators) is a critical threat to the quality of large-scale computing in different application …

Challenges on Unveiling Voltage Margins from the Node to the Datacentre Level

G Papadimitriou, D Gizopoulos - … at the EDGE: New Challenges for …, 2022 - Springer
In this chapter, we present and discuss one of the most important aspects of technology
scaling: the improvement of power consumption of microprocessors. First, we present the …

Подход автоматизации мониторинга дисковых носителей для системы оркестрации контейнеров Kubernetes

АС Шемякинская, ИВ Никифоров - Труды Института системного …, 2020 - cyberleninka.ru
На сегодняшний день трудоемкой и нетривиальной задачей является автоматизация
мониторинга дисковых носителей в кластерной инфраструктуре при использовании …