Unprotected computing: A large-scale study of dram raw error rate on a supercomputer

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org

Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …

被引用次数：31 相关文章所有 12 个版本

[PDF] gruss.cc

CSI: Rowhammer–Cryptographic security and integrity against rowhammer

J Juffinger, L Lamster, A Kogler… - … IEEE Symposium on …, 2023 - ieeexplore.ieee.org

In this paper, we present CSI: Rowhammer, a principled hardware-software co-design
Rowhammer mitigation with cryptographic security and integrity guarantees, that does not …

被引用次数：43 相关文章所有 10 个版本

[PDF] ieee.org

FT-CNN: Algorithm-based fault tolerance for convolutional neural networks

K Zhao, S Di, S Li, X Liang, Y Zhai… - … on Parallel and …, 2020 - ieeexplore.ieee.org

Convolutional neural networks (CNNs) are becoming more and more important for solving
challenging and critical problems in many fields. CNN inference applications have been …

被引用次数：113 相关文章所有 9 个版本

[PDF] tsinghua.edu.cn

What can we learn from four years of data center hardware failures?

G Wang, L Zhang, W Xu - 2017 47th Annual IEEE/IFIP …, 2017 - ieeexplore.ieee.org

Hardware failures have a big impact on the dependability of large-scale data centers. We
present studies on over 290,000 hardware failure reports collected over the past four years …

被引用次数：144 相关文章所有 9 个版本

[PDF] acm.org

Desh: deep learning for system health prediction of lead times to failure in hpc

A Das, F Mueller, C Siegel, A Vishnu - Proceedings of the 27th …, 2018 - dl.acm.org

Today's large-scale supercomputers encounter faults on a daily basis. Exascale systems are
likely to experience even higher fault rates due to increased component count and density …

被引用次数：107 相关文章所有 4 个版本

A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?

L Bautista-Gomez, A Benoit, S Di, T Herault… - Future Generation …, 2024 - Elsevier

Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …

[PDF] arxiv.org

HARP: Practically and effectively identifying uncorrectable errors in memory chips that use on-die error-correcting codes

M Patel, GF de Oliveira, O Mutlu - MICRO-54: 54th Annual IEEE/ACM …, 2021 - dl.acm.org

Aggressive storage density scaling in modern main memories causes increasing error rates
that are addressed using error-mitigation techniques. State-of-the-art techniques for …

被引用次数：25 相关文章所有 7 个版本

[PDF] nsf.gov

Silent data errors: Sources, detection, and modeling

A Singh, S Chakravarty, G Papadimitriou… - 2023 IEEE 41st VLSI …, 2023 - ieeexplore.ieee.org

Chip manufacturers and hyperscalers are becoming increasingly aware of the problem
posed by Silent Data Errors (SDE) and are taking steps to address it. Major computing …

被引用次数：11 相关文章所有 4 个版本

Predicting faults in high performance computing systems: An in-depth survey of the state-of-the-practice

D Jauk, D Yang, M Schulz - … of the International Conference for High …, 2019 - dl.acm.org

As we near exascale, resilience remains a major technical hurdle. Any technique with the
goal of achieving resilience suffers from having to be reactive, as failures can appear at any …

被引用次数：38 相关文章

[PDF] uoa.gr

Silent data corruptions: The stealthy saboteurs of digital integrity

G Papadimitriou, D Gizopoulos… - 2023 IEEE 29th …, 2023 - ieeexplore.ieee.org

Silent Data Corruptions (SDCs) pose a significant threat to the integrity of digital systems.
These stealthy saboteurs silently corrupt data, remaining undetected by traditional error …

被引用次数：8 相关文章所有 2 个版本

高级搜索

QQ 群