RUAD: Unsupervised anomaly detection in HPC systems

M Molan, A Borghesi, D Cesarini, L Benini… - Future Generation …, 2023 - Elsevier
The increasing complexity of modern high-performance computing (HPC) systems
necessitates the introduction of automated and data-driven methodologies to support system …

Dram failure prediction in aiops: Empirical evaluation, challenges and opportunities

Z Wu, H Xu, G Pang, F Yu, Y Wang, S Jian… - arXiv preprint arXiv …, 2021 - arxiv.org
DRAM failure prediction is a vital task in AIOps, which is crucial to maintain the reliability and
sustainable service of large-scale data centers. However, limited work has been done on …

Anomaly detection and anticipation in high performance computing systems

A Borghesi, M Molan, M Milano… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
In their quest toward Exascale, High Performance Computing (HPC) systems are rapidly
becoming larger and more complex, together with the issues concerning their maintenance …

Himfp: Hierarchical intelligent memory failure prediction for cloud service reliability

Q Yu, W Zhang, P Notaro, S Haeri… - 2023 53rd Annual …, 2023 - ieeexplore.ieee.org
In large-scale datacenters, memory failure is one of the leading causes of server crashes,
and uncorrectable error (UCE) is the major fault type indicating defects of memory modules …

A case for transparent reliability in DRAM systems

M Patel, T Shahroodi, A Manglik, AG Yaglikci… - arXiv preprint arXiv …, 2022 - arxiv.org
Today's systems have diverse needs that are difficult to address using one-size-fits-all
commodity DRAM. Unfortunately, although system designers can theoretically adapt …

An in-depth correlative study between DRAM errors and server failures in production data centers

Z Cheng, S Han, PPC Lee, X Li… - 2022 41st International …, 2022 - ieeexplore.ieee.org
Dynamic Random Access Memory (DRAM) errors are prevalent and lead to server failures
in production data centers. However, little is known about the correlation between DRAM …

From correctable memory errors to uncorrectable memory errors: What error bits tell

C Li, Y Zhang, J Wang, H Chen, X Liu… - … Conference for High …, 2022 - ieeexplore.ieee.org
Uncorrectable memory errors are one of the major failure causes in datacenters. In this
paper, we present an empirical study correlating correctable errors (CEs) and uncorrectable …

Predicting dram-caused node unavailability in hyper-scale clouds

P Zhang, Y Wang, X Ma, Y Xu, B Yao… - 2022 52nd Annual …, 2022 - ieeexplore.ieee.org
DRAM faults are major hardware sources of cloud node unavailability. To enable early
preventive actions and mitigate DRAM fault impacts, prior studies focus on predicting DRAM …

Rethinking the Producer-Consumer Relationship in Modern DRAM-Based Systems

M Patel, T Shahroodi, A Manglik, AG Yağlıkçı… - IEEE …, 2024 - ieeexplore.ieee.org
Generational improvements to commodity DRAM throughout half a century have long
solidified its prevalence as main memory across the computing industry. However …

Reinforcement Learning-based Adaptive Mitigation of Uncorrected DRAM Errors in the Field

I Boixaderas, S Moré, J Bartolome, D Vicente… - Proceedings of the 33rd …, 2024 - dl.acm.org
Scaling to larger systems, with current levels of reliability, requires cost-effective methods to
mitigate hardware failures. One of the main causes of hardware failure is an uncorrected …