Understanding failures in petascale computers

B Schroeder, GA Gibson - Journal of Physics: Conference Series, 2007 - iopscience.iop.org
With petascale computers only a year or two away there is a pressing need to anticipate and
compensate for a probable increase in failure and application interruption rates …

[图书][B] Computer architecture: a quantitative approach

JL Hennessy, DA Patterson - 2017 - books.google.com
Computer Architecture: A Quantitative Approach, Sixth Edition has been considered
essential reading by instructors, students and practitioners of computer design for over 20 …

A large-scale study of failures in high-performance computing systems

B Schroeder, GA Gibson - IEEE transactions on Dependable …, 2009 - ieeexplore.ieee.org
Designing highly dependable systems requires a good understanding of failure
characteristics. Unfortunately, little raw data on failures in large IT installations are publicly …

Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

B Schroeder, GA Gibson - ACM Transactions on Storage (TOS), 2007 - dl.acm.org
Component failure in large-scale IT installations is becoming an ever-larger problem as the
number of components in a single cluster approaches a million. This article is an extension …

Why do Internet services fail, and what can be done about it?

D Oppenheimer, A Ganapathi… - 4th Usenix Symposium on …, 2003 - usenix.org
In 1986 Jim Gray published his landmark study of the causes of failures of Tandem systems
and the techniques Tandem used to prevent such failures See J. Gray. Why do computers …

X-ray: Automating {Root-Cause} diagnosis of performance anomalies in production software

M Attariyan, M Chow, J Flinn - 10th USENIX Symposium on Operating …, 2012 - usenix.org
Troubleshooting the performance of production software is challenging. Most existing tools,
such as profiling, tracing, and logging systems, reveal what events occurred during …

An empirical study on configuration errors in commercial and open source systems

Z Yin, X Ma, J Zheng, Y Zhou… - Proceedings of the …, 2011 - dl.acm.org
Configuration errors (ie, misconfigurations) are among the dominant causes of system
failures. Their importance has inspired many research efforts on detecting, diagnosing, and …

[PDF][PDF] Microreboot--a technique for cheap recovery

G Candea, S Kawamoto, Y Fujiki, G Friedman… - arXiv preprint cs …, 2004 - usenix.org
A significant fraction of software failures in large-scale Internet systems are cured by
rebooting, even when the exact failure causes are unknown. However, rebooting can be …

[PDF][PDF] Recovery-oriented computing (ROC): Motivation, definition, techniques, and case studies

D Patterson, A Brown, P Broadwell, G Candea, M Chen… - 2002 - Citeseer
It is time to broaden our performance-dominated research agenda. A four order of
magnitude increase in performance since the first ASPLOS in 1982 means that few outside …

[PDF][PDF] Automating configuration troubleshooting with dynamic information flow analysis

M Attariyan, J Flinn - 9th USENIX Symposium on Operating Systems …, 2010 - usenix.org
Software misconfigurations are time-consuming and enormously frustrating to troubleshoot.
In this paper, we show that dynamic information flow analysis helps solve these problems by …