Imdiffusion: Imputed diffusion models for multivariate time series anomaly detection

Y Chen, C Zhang, M Ma, Y Liu, R Ding, B Li… - arXiv preprint arXiv …, 2023 - arxiv.org
Anomaly detection in multivariate time series data is of paramount importance for ensuring
the efficient operation of large-scale systems across diverse domains. However, accurately …

Xpert: Empowering incident management with query recommendations via large language models

Y Jiang, C Zhang, S He, Z Yang, M Ma, S Qin… - Proceedings of the …, 2024 - dl.acm.org
Large-scale cloud systems play a pivotal role in modern IT infrastructure. However, incidents
occurring within these systems can lead to service disruptions and adversely affect user …

Assess and summarize: Improve outage understanding with large language models

P Jin, S Zhang, M Ma, H Li, Y Kang, L Li, Y Liu… - Proceedings of the 31st …, 2023 - dl.acm.org
Cloud systems have become increasingly popular in recent years due to their flexibility and
scalability. Each time cloud computing applications and services hosted on the cloud are …

[PDF][PDF] Empowering practical root cause analysis by large language models for cloud incidents

Y Chen, H Xie, M Ma, Y Kang, X Gao… - arXiv preprint arXiv …, 2023 - yinfangchen.github.io
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

Automatic root cause analysis via large language models for cloud incidents

Y Chen, H Xie, M Ma, Y Kang, X Gao, L Shi… - Proceedings of the …, 2024 - dl.acm.org
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

Monitorassistant: Simplifying cloud service monitoring via large language models

Z Yu, M Ma, C Zhang, S Qin, Y Kang, C Bansal… - … Proceedings of the …, 2024 - dl.acm.org
In large-scale cloud service systems, monitoring metric data and conducting anomaly
detection is an important way to maintain reliability and stability. However, great disparity …

Kivi: Verification for Cluster Management

B Liu, G Lim, R Beckett, PB Godfrey - 2024 USENIX Annual Technical …, 2024 - usenix.org
Modern cloud infrastructure is powered by cluster management systems such as Kubernetes
and Docker Swarm. While these systems seek to minimize users' operational burden, the …

Building AI Agents for Autonomous Clouds: Challenges and Design Principles

M Shetty, Y Chen, G Somashekar, M Ma… - Proceedings of the …, 2024 - dl.acm.org
The rapid growth in the use of Large Language Models (LLMs) and AI Agents as part of
software development and deployment is revolutionizing the information technology …

Large Language Models Can Provide Accurate and Interpretable Incident Triage

Z Wang, J Li, M Ma, Z Li, Y Kang… - 2024 IEEE 35th …, 2024 - ieeexplore.ieee.org
Large-scale cloud services frequently experience incidents that can have a significant
impact on their stability. Incident triage is a critical process that assigns incidents to …

Kivi: Verification for Cluster Management

B Liu, G Lim, R Beckett, P Godfrey - arXiv preprint arXiv:2311.02800, 2023 - arxiv.org
Modern cloud infrastructure is powered by cluster management systems such as Kubernetes
and Docker Swarm. While these systems seek to minimize users' operational burden, the …