作者
Charu Kalra, Fritz Previlon, Xiangyu Li, Norman Rubin, David Kaeli
发表日期
2018/11/11
研讨会论文
International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'18)
简介
As Graphics Processing Units (GPUs) become more pervasive in High Performance Computing (HPC) and safety- critical domains, ensuring that GPU applications can be protected from data corruption grows in importance. Despite prior efforts to mitigate errors, we still lack a clear understanding of how resilient these applications are in the presence of transient faults. Due to the random nature of these faults, predicting whether they will alter program output is a challenging problem. In this paper, we build a framework named PRISM which uses a systematic approach to predict failures in GPU programs. PRISM extracts micro-architecture agnostic features to characterize program resiliency, which serve as predictors to drive our statistical model. PRISM enables us to predict failures in applications without running exhaustive fault injection campaigns, thereby reducing the error estimation effort. PRISM can also be …
引用总数
20192020202120222023911446
学术搜索中的文章
C Kalra, F Previlon, X Li, N Rubin, D Kaeli - SC18: International Conference for High Performance …, 2018