A survey of online failure prediction methods

F Salfner, M Lenk, M Malek - ACM Computing Surveys (CSUR), 2010 - dl.acm.org
With the ever-growing complexity and dynamicity of computer systems, proactive fault
management is an effective approach to enhancing availability. Online failure prediction is …

Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities

F Cappello - The International Journal of High Performance …, 2009 - journals.sagepub.com
The emergence of petascale systems and the promise of future exascale systems have
reinvigorated the community interest in how to manage failures in such systems and ensure …

Why does the cloud stop computing? lessons from hundreds of service outages

HS Gunawi, M Hao, RO Suminto, A Laksono… - Proceedings of the …, 2016 - dl.acm.org
We conducted a cloud outage study (COS) of 32 popular Internet services. We analyzed
1247 headline news and public post-mortem reports that detail 597 unplanned outages that …

Informed haar-like features improve pedestrian detection

S Zhang, C Bauckhage… - Proceedings of the IEEE …, 2014 - cv-foundation.org
We propose a simple yet effective detector for pedestrian detection. The basic idea is to
incorporate common sense and everyday knowledge into the design of simple and …

What supercomputers say: A study of five system logs

A Oliner, J Stearley - 37th annual IEEE/IFIP international …, 2007 - ieeexplore.ieee.org
If we hope to automatically detect and diagnose failures in large-scale computer systems,
we must study real deployed systems and the data they generate. Progress has been …

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

IP Egwutuoha, D Levy, B Selic, S Chen - The Journal of Supercomputing, 2013 - Springer
Abstract In recent years, High Performance Computing (HPC) systems have been shifting
from expensive massively parallel architectures to clusters of commodity PCs to take …

Failures in large scale systems: long-term measurement, analysis, and implications

S Gupta, T Patel, C Engelmann, D Tiwari - Proceedings of the …, 2017 - dl.acm.org
Resilience is one of the key challenges in maintaining high efficiency of future extreme scale
supercomputers. Researchers and system practitioners rely on field-data studies to …

Lessons learned from the analysis of system failures at petascale: The case of blue waters

C Di Martino, Z Kalbarczyk, RK Iyer… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org
This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid
(CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis …

Toward exascale resilience

F Cappello, A Geist, B Gropp, L Kale… - … Journal of High …, 2009 - journals.sagepub.com
Over the past few years resilience has became a major issue for high-performance
computing (HPC) systems, in particular in the perspective of large petascale systems and …

Failure prediction in ibm bluegene/l event logs

Y Liang, Y Zhang, H Xiong… - … Conference on Data …, 2007 - ieeexplore.ieee.org
Frequent failures are becoming a serious concern to the community of high-end computing,
especially when the applications and the underlying systems rapidly grow in size and …