Himfp: Hierarchical intelligent memory failure prediction for cloud service reliability

Q Yu, W Zhang, P Notaro, S Haeri… - 2023 53rd Annual …, 2023 - ieeexplore.ieee.org
In large-scale datacenters, memory failure is one of the leading causes of server crashes,
and uncorrectable error (UCE) is the major fault type indicating defects of memory modules …

From correctable memory errors to uncorrectable memory errors: What error bits tell

C Li, Y Zhang, J Wang, H Chen, X Liu… - … Conference for High …, 2022 - ieeexplore.ieee.org
Uncorrectable memory errors are one of the major failure causes in datacenters. In this
paper, we present an empirical study correlating correctable errors (CEs) and uncorrectable …

Removing Obstacles before Breaking Through the Memory Wall: A Close Look at {HBM} Errors in the Field

R Wu, S Zhou, J Lu, Z Shen, Z Xu, J Shu… - 2024 USENIX Annual …, 2024 - usenix.org
High-bandwidth memory (HBM) is regarded as a promising technology for fundamentally
overcoming the memory wall. It stacks up multiple DRAM dies vertically to dramatically …

Fault-aware prediction-guided page offlining for uncorrectable memory error prevention

X Du, C Li, S Zhou, X Liu, X Xu… - 2021 IEEE 39th …, 2021 - ieeexplore.ieee.org
Uncorrectable memory errors are the major causes of hardware failures in datacenters
leading to server crashes. Page offlining is an error-prevention mechanism implemented in …

An optical transceiver reliability study based on sfp monitoring and os-level metric data

P Notaro, Q Yu, S Haeri, J Cardoso… - 2023 IEEE/ACM 23rd …, 2023 - ieeexplore.ieee.org
The increasing demand for cloud computing drives the expansion in scale of datacenters
and their internal optical network, in a strive for increasing bandwidth, high reliability, and …

Review of Memory RAS for Data Centers

J Lee, MJ Kim, WS Kim, YS Kim - IEEE Access, 2023 - ieeexplore.ieee.org
Multi-bit error and downtime due to uncorrectable error (UE) in a dual in line memory
module (DIMM) have received great attention in data centers for its high repair or …

AI-based Proactive Failure Management in Large-scale Cloud Environments

P Notaro - 2024 - mediatum.ub.tum.de
Modern IT infrastructures are becoming increasingly large and complex, creating challenges
for O&M teams in managing and optimizing cloud services. AIOps supports O&M through the …

Exploring Error Bits for Memory Failure Prediction: An In-Depth Correlative Study

Q Yu, W Zhang, J Cardoso… - 2023 IEEE/ACM …, 2023 - ieeexplore.ieee.org
In large-scale datacenters, memory failure is a common cause of server crashes, with
uncorrectable errors (UEs) being a major indicator of Dual Inline Memory Module (DIMM) …

Investigating Memory Failure Prediction Across CPU Architectures

Q Yu, W Zhang, M Zhou, J Yu, Z Sheng… - arXiv preprint arXiv …, 2024 - arxiv.org
Large-scale datacenters often experience memory failures, where Uncorrectable Errors
(UEs) highlight critical malfunction in Dual Inline Memory Modules (DIMMs). Existing …

ML-driven risk estimation for memory failure in a data center environment with convolutional neural networks, self-supervised data labeling and distribution-based …

T Breitenbach, SM Divakar, L Rasbach… - Journal of Parallel and …, 2024 - Elsevier
With the trend towards multi-socket server systems, the demand for random access memory
(RAM) per server increased. The consequence are more DIMM sockets per server. Since …