查看文章

cmu.edu 中的 [PDF]

A large-scale study of failures in high-performance computing systems

作者

Bianca Schroeder, Garth A Gibson

发表日期

2009/2/6

期刊

IEEE transactions on Dependable and Secure Computing

卷号

期号

页码范围

337-350

出版商

Ieee

简介

Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations are publicly available. This paper analyzes failure data collected at two large high-performance computing sites. The first data set has been collected over the past nine years at Los Alamos National Laboratory (LANL) and has recently been made publicly available. It covers 23,000 failures recorded on more than 20 different systems at LANL, mostly large clusters of SMP and NUMA nodes. The second data set has been collected over the period of one year on one large supercomputing system comprising 20 nodes and more than 10,000 processors. We study the statistics of the data, including the root cause of failures, the mean time between failures, and the mean time to repair. We find, for example, that average failure rates differ wildly across systems …

引用总数

被引用次数：1573

2008200920102011201220132014201520162017201820192020202120222023202460 64 96 108 99 106 135 111 133 134 103 79 81 56 47 45 18

学术搜索中的文章

A large-scale study of failures in high-performance computing systems

B Schroeder, GA Gibson - IEEE transactions on Dependable and Secure …, 2009

被引用次数：1573 相关文章所有 29 个版本