Bluegene/l failure analysis and prediction models

F Salfner, M Lenk, M Malek - ACM Computing Surveys (CSUR), 2010 - dl.acm.org

With the ever-growing complexity and dynamicity of computer systems, proactive fault
management is an effective approach to enhancing availability. Online failure prediction is …

被引用次数：810 相关文章所有 11 个版本

Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities

F Cappello - The International Journal of High Performance …, 2009 - journals.sagepub.com

The emergence of petascale systems and the promise of future exascale systems have
reinvigorated the community interest in how to manage failures in such systems and ensure …

被引用次数：304 相关文章所有 4 个版本

[PDF] drj.com

Why does the cloud stop computing? lessons from hundreds of service outages

HS Gunawi, M Hao, RO Suminto, A Laksono… - Proceedings of the …, 2016 - dl.acm.org

We conducted a cloud outage study (COS) of 32 popular Internet services. We analyzed
1247 headline news and public post-mortem reports that detail 597 unplanned outages that …

被引用次数：277 相关文章所有 6 个版本

[PDF] cv-foundation.org

Informed haar-like features improve pedestrian detection

S Zhang, C Bauckhage… - Proceedings of the IEEE …, 2014 - cv-foundation.org

We propose a simple yet effective detector for pedestrian detection. The basic idea is to
incorporate common sense and everyday knowledge into the design of simple and …

被引用次数：406 相关文章所有 10 个版本

[PDF] psu.edu

What supercomputers say: A study of five system logs

A Oliner, J Stearley - 37th annual IEEE/IFIP international …, 2007 - ieeexplore.ieee.org

If we hope to automatically detect and diagnose failures in large-scale computer systems,
we must study real deployed systems and the data they generate. Progress has been …

被引用次数：674 相关文章所有 8 个版本

[PDF] springer.com

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

IP Egwutuoha, D Levy, B Selic, S Chen - The Journal of Supercomputing, 2013 - Springer

Abstract In recent years, High Performance Computing (HPC) systems have been shifting
from expensive massively parallel architectures to clusters of commodity PCs to take …

被引用次数：338 相关文章所有 12 个版本

[PDF] acm.org

Failures in large scale systems: long-term measurement, analysis, and implications

S Gupta, T Patel, C Engelmann, D Tiwari - Proceedings of the …, 2017 - dl.acm.org

Resilience is one of the key challenges in maintaining high efficiency of future extreme scale
supercomputers. Researchers and system practitioners rely on field-data studies to …

被引用次数：172 相关文章所有 12 个版本

[PDF] archive.org

Lessons learned from the analysis of system failures at petascale: The case of blue waters

C Di Martino, Z Kalbarczyk, RK Iyer… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org

This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid
(CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis …

被引用次数：272 相关文章所有 5 个版本

[PDF] illinois.edu

Toward exascale resilience

F Cappello, A Geist, B Gropp, L Kale… - … Journal of High …, 2009 - journals.sagepub.com

Over the past few years resilience has became a major issue for high-performance
computing (HPC) systems, in particular in the perspective of large petascale systems and …

被引用次数：482 相关文章所有 14 个版本

[PDF] researchgate.net

Failure prediction in ibm bluegene/l event logs

Y Liang, Y Zhang, H Xiong… - … Conference on Data …, 2007 - ieeexplore.ieee.org

Frequent failures are becoming a serious concern to the community of high-end computing,
especially when the applications and the underlying systems rapidly grow in size and …

被引用次数：332 相关文章所有 14 个版本

高级搜索

QQ 群