Crc-based memory reliability for task-parallel HPC applications

O Subasi, O Unsal, J Labarta, G Yalcin… - 2016 IEEE …, 2016 - ieeexplore.ieee.org
Memory reliability will be one of the major concerns for future HPC and Exascale systems.
This concern is mostly attributed to the expected massive increase in memory capacity and …

ECC Parity: A technique for efficient memory error resilience for multi-channel memory systems

X Jian, R Kumar - SC'14: Proceedings of the International …, 2014 - ieeexplore.ieee.org
Servers and HPC systems often use a strong memory error correction code, or ECC, to meet
their reliability and availability requirements. However, these ECCs often require significant …

Havens: Explicit reliable memory regions for HPC applications

S Hukerikar, C Engelmann - 2016 IEEE High Performance …, 2016 - ieeexplore.ieee.org
Supporting error resilience in future exascale-class supercomputing systems is a critical
challenge. Due to transistor scaling trends and increasing memory density, scientific …

Flipsphere: A software-based DRAM error detection and correction library for HPC

D Fiala, F Mueller, KB Ferreira - 2016 IEEE/ACM 20th …, 2016 - ieeexplore.ieee.org
Proposed exascale systems will present considerable challenges. In particular, DRAM soft-
errors, or bit-flips, are expected to greatly increase due to higher memory densities and near …

Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory

J Kim, M Sullivan, M Erez - 2015 IEEE 21st International …, 2015 - ieeexplore.ieee.org
Growing computer system sizes and levels of integration have made memory reliability a
primary concern, necessitating strong memory error protection. As such, large-scale systems …

Building fast, dense, low-power caches using erasure-based inline multi-bit ecc

J Kim, H Yang, MP McCartney… - 2013 IEEE 19th …, 2013 - ieeexplore.ieee.org
The embedded memory hierarchy of microprocessors and systems-on-a-chip plays a critical
role in the overall system performance, area, power, resilience, and yield. However, as …

System-level hardware-based protection of memories against soft-errors

V Gherman, S Evain, M Cartron… - … , Automation & Test …, 2009 - ieeexplore.ieee.org
We present a hardware-based approach to improve the resilience of a computer system
against the errors occurred in the main memory with the help of error detecting and …

Improving application resilience to memory errors with lightweight compression

S Levy, KB Ferreira, PG Bridges - SC'16: Proceedings of the …, 2016 - ieeexplore.ieee.org
In next-generation extreme-scale systems, application performance will be limited by
memory performance characteristics. The first exascale system is projected to contain many …

Software-only based diverse redundancy for asil-d automotive applications on embedded hpc platforms

S Alcaide, L Kosmidis, C Hernandez… - 2020 IEEE International …, 2020 - ieeexplore.ieee.org
High-Performance Computing (HPC) platforms become a must in automotive systems to
enable autonomous driving. However, automotive platforms must avoid Common Cause …

Designing and modelling selective replication for fault-tolerant hpc applications

O Subasi, G Yalcin, F Zyulkyarov… - 2017 17th IEEE/ACM …, 2017 - ieeexplore.ieee.org
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for
High Performance Computing (HPC) applications. There are studies that address fail-stop …