Large-scale cloud systems play a pivotal role in modern IT infrastructure. However, incidents occurring within these systems can lead to service disruptions and adversely affect user …
P Jin, S Zhang, M Ma, H Li, Y Kang, L Li, Y Liu… - Proceedings of the 31st …, 2023 - dl.acm.org
Cloud systems have become increasingly popular in recent years due to their flexibility and scalability. Each time cloud computing applications and services hosted on the cloud are …
Y Chen, H Xie, M Ma, Y Kang, X Gao… - arXiv preprint arXiv …, 2023 - yinfangchen.github.io
Ensuring the reliability and availability of cloud services necessitates efficient root cause analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …
Y Chen, H Xie, M Ma, Y Kang, X Gao, L Shi… - Proceedings of the …, 2024 - dl.acm.org
Ensuring the reliability and availability of cloud services necessitates efficient root cause analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …
In large-scale cloud service systems, monitoring metric data and conducting anomaly detection is an important way to maintain reliability and stability. However, great disparity …
Modern cloud infrastructure is powered by cluster management systems such as Kubernetes and Docker Swarm. While these systems seek to minimize users' operational burden, the …
The rapid growth in the use of Large Language Models (LLMs) and AI Agents as part of software development and deployment is revolutionizing the information technology …
Z Wang, J Li, M Ma, Z Li, Y Kang… - 2024 IEEE 35th …, 2024 - ieeexplore.ieee.org
Large-scale cloud services frequently experience incidents that can have a significant impact on their stability. Incident triage is a critical process that assigns incidents to …
Modern cloud infrastructure is powered by cluster management systems such as Kubernetes and Docker Swarm. While these systems seek to minimize users' operational burden, the …