Reading between the lines of failure logs: Understanding how HPC systems fail

S He, P He, Z Chen, T Yang, Y Su, MR Lyu - ACM computing surveys …, 2021 - dl.acm.org

Logs are semi-structured text generated by logging statements in software source code. In
recent decades, software logs have become imperative in the reliability assurance …

被引用次数：194 相关文章所有 8 个版本

[PDF] acm.org

The landscape of exascale research: A data-driven literature analysis

S Heldens, P Hijma, BV Werkhoven… - ACM Computing …, 2020 - dl.acm.org

The next generation of supercomputers will break the exascale barrier. Soon we will have
systems capable of at least one quintillion (billion billion) floating-point operations per …

被引用次数：64 相关文章所有 4 个版本

[PDF] drj.com

Why does the cloud stop computing? lessons from hundreds of service outages

HS Gunawi, M Hao, RO Suminto, A Laksono… - Proceedings of the …, 2016 - dl.acm.org

We conducted a cloud outage study (COS) of 32 popular Internet services. We analyzed
1247 headline news and public post-mortem reports that detail 597 unplanned outages that …

被引用次数：277 相关文章所有 6 个版本

[PDF] acm.org

Failures in large scale systems: long-term measurement, analysis, and implications

S Gupta, T Patel, C Engelmann, D Tiwari - Proceedings of the …, 2017 - dl.acm.org

Resilience is one of the key challenges in maintaining high efficiency of future extreme scale
supercomputers. Researchers and system practitioners rely on field-data studies to …

被引用次数：173 相关文章所有 12 个版本

[PDF] tsinghua.edu.cn

What can we learn from four years of data center hardware failures?

G Wang, L Zhang, W Xu - 2017 47th Annual IEEE/IFIP …, 2017 - ieeexplore.ieee.org

Hardware failures have a big impact on the dependability of large-scale data centers. We
present studies on over 290,000 hardware failure reports collected over the past four years …

被引用次数：144 相关文章所有 9 个版本

[PDF] osti.gov

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

D Tiwari, S Gupta, J Rogers, D Maxwell… - 2015 IEEE 21st …, 2015 - ieeexplore.ieee.org

Increase in graphics hardware performance and improvements in programmability has
enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose …

被引用次数：191 相关文章所有 9 个版本

[PDF] usenix.org

An analysis of {Network-Partitioning} failures in cloud systems

A Alquraan, H Takruri, M Alfatafta… - 13th USENIX Symposium …, 2018 - usenix.org

We present a comprehensive study of 136 system failures attributed to network-partitioning
faults from 25 widely used distributed systems. We found that the majority of the failures led …

被引用次数：92 相关文章所有 15 个版本

[PDF] semanticscholar.org

Failure analysis of jobs in compute clouds: A google cluster case study

X Chen, CD Lu, K Pattabiraman - 2014 IEEE 25th International …, 2014 - ieeexplore.ieee.org

In this paper, we analyze a workload trace from the Google cloud cluster and characterize
the observed failures. The goal of our work is to improve the understanding of failures in …

被引用次数：122 相关文章所有 7 个版本

Failure analysis of virtual and physical machines: patterns, causes and characteristics

R Birke, I Giurgiu, LY Chen… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org

In today's commercial data centers, the computation density grows continuously as the
number of hardware components and workloads in units of virtual machines increase. The …

被引用次数：123 相关文章所有 6 个版本

[PDF] psu.edu

Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems

D Tiwari, S Gupta, SS Vazhkudai - 2014 44th Annual IEEE/IFIP …, 2014 - ieeexplore.ieee.org

Continuing increase in the computational power of supercomputers has enabled large-scale
scientific applications in the areas of astrophysics, fusion, climate and combustion to run …

被引用次数：119 相关文章所有 9 个版本

高级搜索

QQ 群