A survey on automated log analysis for reliability engineering

S He, P He, Z Chen, T Yang, Y Su, MR Lyu - ACM computing surveys …, 2021 - dl.acm.org
Logs are semi-structured text generated by logging statements in software source code. In
recent decades, software logs have become imperative in the reliability assurance …

Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges

Q Cheng, D Sahoo, A Saha, W Yang, C Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big
data generated by IT Operations processes, particularly in cloud infrastructures, to provide …

Tools and benchmarks for automated log parsing

J Zhu, S He, J Liu, P He, Q Xie… - 2019 IEEE/ACM 41st …, 2019 - ieeexplore.ieee.org
Logs are imperative in the development and maintenance process of many software
systems. They record detailed runtime information that allows developers and support …

Log clustering based problem identification for online service systems

Q Lin, H Zhang, JG Lou, Y Zhang, X Chen - Proceedings of the 38th …, 2016 - dl.acm.org
Logs play an important role in the maintenance of large-scale online service systems. When
an online service fails, engineers need to examine recorded logs to gain insights into the …

Identifying impactful service system problems via log analysis

S He, Q Lin, JG Lou, H Zhang, MR Lyu… - Proceedings of the 2018 …, 2018 - dl.acm.org
Logs are often used for troubleshooting in large-scale software systems. For a cloud-based
online system that provides 24/7 service, a huge number of logs could be generated every …

Towards intelligent incident management: why we need it and how we make it

Z Chen, Y Kang, L Li, X Zhang, H Zhang, H Xu… - Proceedings of the 28th …, 2020 - dl.acm.org
The management of cloud service incidents (unplanned interruptions or outages of a
service/product) greatly affects customer satisfaction and business revenue. After years of …

Predicting node failures in an ultra-large-scale cloud computing platform: an aiops solution

Y Li, ZM Jiang, H Li, AE Hassan, C He… - ACM Transactions on …, 2020 - dl.acm.org
Many software services today are hosted on cloud computing platforms, such as Amazon
EC2, due to many benefits like reduced operational costs. However, node failures in these …

Log2: A {Cost-Aware} logging mechanism for performance diagnosis

R Ding, H Zhou, JG Lou, H Zhang, Q Lin, Q Fu… - 2015 USENIX annual …, 2015 - usenix.org
Logging has been a common practice for monitoring and diagnosing performance issues.
However, logging comes at a cost, especially for large-scale online service systems. First …

An empirical study of the impact of data splitting decisions on the performance of AIOps solutions

Y Lyu, H Li, M Sayagh, ZM Jiang… - ACM Transactions on …, 2021 - dl.acm.org
AIOps (Artificial Intelligence for IT Operations) leverages machine learning models to help
practitioners handle the massive data produced during the operations of large-scale …

On the model update strategies for supervised learning in aiops solutions

Y Lyu, H Li, ZM Jiang, AE Hassan - ACM Transactions on Software …, 2024 - dl.acm.org
AIOps (Artificial Intelligence for IT Operations) solutions leverage the massive data produced
during the operation of large-scale systems and machine learning models to assist software …