Filtering failure logs for a bluegene/l prototype

A Oliner, J Stearley - 37th annual IEEE/IFIP international …, 2007 - ieeexplore.ieee.org

If we hope to automatically detect and diagnose failures in large-scale computer systems,
we must study real deployed systems and the data they generate. Progress has been …

被引用次数：674 相关文章所有 8 个版本

[PDF] acm.org

Failures in large scale systems: long-term measurement, analysis, and implications

S Gupta, T Patel, C Engelmann, D Tiwari - Proceedings of the …, 2017 - dl.acm.org

Resilience is one of the key challenges in maintaining high efficiency of future extreme scale
supercomputers. Researchers and system practitioners rely on field-data studies to …

被引用次数：172 相关文章所有 12 个版本

[PDF] archive.org

Lessons learned from the analysis of system failures at petascale: The case of blue waters

C Di Martino, Z Kalbarczyk, RK Iyer… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org

This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid
(CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis …

被引用次数：272 相关文章所有 5 个版本

[PDF] tsinghua.edu.cn

What can we learn from four years of data center hardware failures?

G Wang, L Zhang, W Xu - 2017 47th Annual IEEE/IFIP …, 2017 - ieeexplore.ieee.org

Hardware failures have a big impact on the dependability of large-scale data centers. We
present studies on over 290,000 hardware failure reports collected over the past four years …

被引用次数：144 相关文章所有 9 个版本

[PDF] osti.gov

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

D Tiwari, S Gupta, J Rogers, D Maxwell… - 2015 IEEE 21st …, 2015 - ieeexplore.ieee.org

Increase in graphics hardware performance and improvements in programmability has
enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose …

被引用次数：191 相关文章所有 9 个版本

[PDF] rutgers.edu

Bluegene/l failure analysis and prediction models

Y Liang, Y Zhang, A Sivasubramaniam… - … and Networks (DSN' …, 2006 - ieeexplore.ieee.org

The growing computational and storage needs of several scientific applications mandate the
deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can …

被引用次数：402 相关文章所有 14 个版本

[PDF] psu.edu

Exploring event correlation for failure prediction in coalitions of clusters

S Fu, CZ Xu - Proceedings of the 2007 ACM/IEEE conference on …, 2007 - dl.acm.org

In large-scale networked computing systems, component failures become norms instead of
exceptions. Failure prediction is a crucial technique for self-managing resource burdens …

被引用次数：243 相关文章所有 12 个版本

Log‐based anomaly detection for distributed systems: State of the art, industry experience, and open issues

X Wei, J Wang, C Sun, D Towey… - Journal of Software …, 2024 - Wiley Online Library

Distributed systems have been widely used in many safety‐critical areas. Any abnormalities
(eg, service interruption or service quality degradation) could lead to application crashes or …

被引用次数：1 相关文章

[PDF] wm.edu

A large-scale study of soft-errors on GPUs in the field

B Nie, D Tiwari, S Gupta, E Smirni… - 2016 IEEE International …, 2016 - ieeexplore.ieee.org

Parallelism provided by the GPU architecture has enabled domain scientists to simulate
physical phenomena at a much faster rate and finer granularity than what was previously …

被引用次数：103 相关文章所有 7 个版本

Reliability lessons learned from gpu experience with the titan supercomputer at oak ridge leadership computing facility

D Tiwari, S Gupta, G Gallarno, J Rogers… - Proceedings of the …, 2015 - dl.acm.org

The high computational capability of graphics processing units (GPUs) is enabling and
driving the scientific discovery process at large-scale. The world's second fastest …

被引用次数：99 相关文章所有 6 个版本

高级搜索

QQ 群