作者
Omer Subasi, Osman Unsal, Jesus Labarta, Gulay Yalcin, Adrian Cristal
发表日期
2016/5/23
研讨会论文
2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
页码范围
1101-1112
出版商
IEEE
简介
Memory reliability will be one of the major concerns for future HPC and Exascale systems. This concern is mostly attributed to the expected massive increase in memory capacity and the number of memory devices in Exascale systems. For memory systems Error Correcting Codes (ECC) are the most commonly used mechanism. However state-of-the art hardware ECCs will not be sufficient in terms of error coverage for future computing systems and stronger hardware ECCs providing more coverage have prohibitive costs in terms of area, power and latency. Software-based solutions are needed to cooperate with hardware. In this work, we propose a Cyclic Redundancy Checks (CRCs) based software mechanism for task-parallel HPC applications. Our mechanism incurs only 1.7% performance overhead with hardware acceleration while being highly scalable at large scale. Our mathematical analysis demonstrates …
引用总数
20162017201820192020202112322
学术搜索中的文章
O Subasi, O Unsal, J Labarta, G Yalcin, A Cristal - 2016 IEEE International Parallel and Distributed …, 2016