Continuous incident triage for large-scale online service systems

J Chen, X He, Q Lin, H Zhang, D Hao… - 2019 34th IEEE/ACM …, 2019 - ieeexplore.ieee.org
In recent years, online service systems have become increasingly popular. Incidents of
these systems could cause significant economic loss and customer dissatisfaction. Incident …

An empirical investigation of incident triage for online service systems

J Chen, X He, Q Lin, Y Xu, H Zhang… - 2019 IEEE/ACM 41st …, 2019 - ieeexplore.ieee.org
Online service systems have become increasingly popular. During operation of an online
service system, incidents (unplanned interruptions or outages of the service) are inevitable …

How incidental are the incidents? characterizing and prioritizing incidents for large-scale online service systems

J Chen, S Zhang, X He, Q Lin, H Zhang, D Hao… - Proceedings of the 35th …, 2020 - dl.acm.org
Although tremendous efforts have been devoted to the quality assurance of online service
systems, in reality, these systems still come across many incidents (ie, unplanned …

Predicting node failures in an ultra-large-scale cloud computing platform: an aiops solution

Y Li, ZM Jiang, H Li, AE Hassan, C He… - ACM Transactions on …, 2020 - dl.acm.org
Many software services today are hosted on cloud computing platforms, such as Amazon
EC2, due to many benefits like reduced operational costs. However, node failures in these …

Real-time incident prediction for online service systems

N Zhao, J Chen, Z Wang, X Peng, G Wang… - Proceedings of the 28th …, 2020 - dl.acm.org
Incidents in online service systems could dramatically degrade system availability and
destroy user experience. To guarantee service quality and reduce economic loss, it is …

An empirical study of the impact of data splitting decisions on the performance of AIOps solutions

Y Lyu, H Li, M Sayagh, ZM Jiang… - ACM Transactions on …, 2021 - dl.acm.org
AIOps (Artificial Intelligence for IT Operations) leverages machine learning models to help
practitioners handle the massive data produced during the operations of large-scale …

On the model update strategies for supervised learning in aiops solutions

Y Lyu, H Li, ZM Jiang, AE Hassan - ACM Transactions on Software …, 2024 - dl.acm.org
AIOps (Artificial Intelligence for IT Operations) solutions leverage the massive data produced
during the operation of large-scale systems and machine learning models to assist software …

A survey on intelligent management of alerts and incidents in IT services

Q Yu, N Zhao, M Li, Z Li, H Wang, W Zhang… - Journal of Network and …, 2024 - Elsevier
Modern service systems are constantly improving with the development of various IT
technologies, leading to a boost in system scales and complex dependencies among …

Fighting the fog of war: Automated incident detection for cloud systems

L Li, X Zhang, X Zhao, H Zhang, Y Kang… - 2021 USENIX Annual …, 2021 - usenix.org
Incidents and outages dramatically degrade the availability of large-scale cloud computing
systems such as AWS, Azure, and GCP. In current incident response practice, each team …

Constructing large-scale real-world benchmark datasets for aiops

Z Li, N Zhao, S Zhang, Y Sun, P Chen, X Wen… - arXiv preprint arXiv …, 2022 - arxiv.org
Recently, AIOps (Artificial Intelligence for IT Operations) has been well studied in academia
and industry to enable automated and effective software service management. Plenty of …