作者
Ramendra K Sahoo, Mark S Squillante, Anand Sivasubramaniam, Yanyong Zhang
发表日期
2004/6/28
研讨会论文
International Conference on Dependable Systems and Networks, 2004
页码范围
772-781
出版商
IEEE
简介
The growing complexity of hardware and software mandates the recognition of fault occurrence in system deployment and management. While there are several techniques to prevent and/or handle faults, there continues to be a growing need for an in-depth understanding of system errors and failures and their empirical and statistical properties. This understanding can help evaluate the effectiveness of different techniques for improving system availability, in addition to developing new solutions. In this paper, we analyze the empirical and statistical properties of system errors and failures from a network of nearly 400 heterogeneous servers running a diverse workload over a year. While improvements in system robustness continue to limit the number of actual failures to a very small fraction of the recorded errors, the failure rates are significant and highly variable. Our results also show that the system error and …
引用总数
200320042005200620072008200920102011201220132014201520162017201820192020202120222023202412101420151826242521272225151810712841
学术搜索中的文章
RK Sahoo, MS Squillante, A Sivasubramaniam… - International Conference on Dependable Systems and …, 2004