ServerRCA: Root Cause Analysis for Server Failure using Operating System Logs

J Shi, S Jiang, B Xu, Y Xiao - 2023 IEEE 34th International …, 2023 - ieeexplore.ieee.org
The development of the information technology industry has made servers an essential
infrastructure for enterprises. Server failure may result in significant economic losses …

Software and Infrastructure Log-Based Framework for Identifying the Causes of System Faults

N Hanakawa, M Obana - 2018 25th Asia-Pacific Software …, 2018 - ieeexplore.ieee.org
Recently, computer systems have become increasingly complex, involving a wide range of
infrastructure and software technologies in the same system. Because these complex …

LogExpert: Log-based Recommended Resolutions Generation using Large Language Model

J Wang, G Chu, J Wang, H Sun, Q Qi, Y Wang… - Proceedings of the …, 2024 - dl.acm.org
Software logs play a vital role in ensuring the reliability and availability of large-scale
software systems. In recent years, researchers have made significant efforts to build log …

Fluxrank: A widely-deployable framework to automatically localizing root cause machines for software service failure mitigation

P Liu, Y Chen, X Nie, J Zhu, S Zhang… - 2019 IEEE 30th …, 2019 - ieeexplore.ieee.org
The failures of software service directly affect user experiences and service revenue. Thus
operators monitor both service-level KPIs (eg, response time) and machine-level KPIs (eg …

Detection of software failures through event logs: An experimental study

A Pecchia, S Russo - 2012 IEEE 23rd International Symposium …, 2012 - ieeexplore.ieee.org
Software faults are recognized to be among the main responsible for system failures in many
application domains. Event logs play a key role to support the analysis of failures occurring …

Log clustering based problem identification for online service systems

Q Lin, H Zhang, JG Lou, Y Zhang, X Chen - Proceedings of the 38th …, 2016 - dl.acm.org
Logs play an important role in the maintenance of large-scale online service systems. When
an online service fails, engineers need to examine recorded logs to gain insights into the …

ESRO: Experience Assisted Service Reliability against Outages

S Chakraborty, S Agarwal, S Garg… - 2023 38th IEEE/ACM …, 2023 - ieeexplore.ieee.org
Modern cloud services are prone to failures due to their complex architecture, making
diagnosis a critical process. Site Reliability Engineers (SREs) spend hours leveraging …

Machine deserves better logging: a log enhancement approach for automatic fault diagnosis

T Jia, Y Li, C Zhang, W Xia, J Jiang… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org
When systems fail, log data is often the most important information source for fault diagnosis.
However, the performance of automatic fault diagnosis is limited by the ad-hoc nature of …

FlowRCA: Enhancing Microservice Reliability with Non-invasive Root Cause Analysis

Z Wu, J Wang, Q Qi, MG Shu, R Chu… - … Conference on Web …, 2024 - ieeexplore.ieee.org
Microservice architectures, characterized by their loosely coupled services and complex call
patterns, have become predominant in cloud applications, benefiting from elastic scalability …

Enhancing HPC system log analysis by identifying message origin in source code

M Hickman, D Fulp, E Baseman… - 2018 IEEE …, 2018 - ieeexplore.ieee.org
Supercomputers, high performance computers, and clusters are composed of very large
numbers of independent operating systems that are generating their own system logs …