A Survey on Failure Analysis and Fault Injection in AI Systems

G Yu, G Tan, H Huang, Z Zhang, P Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid advancement of Artificial Intelligence (AI) has led to its integration into various
areas, especially with Large Language Models (LLMs) significantly enhancing capabilities …

Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data

G Yu, P Chen, Y Li, H Chen, X Li, Z Zheng - Proceedings of the 31st …, 2023 - dl.acm.org
Root cause analysis (RCA) in large-scale microservice systems is a critical and challenging
task. To understand and localize root causes of unexpected faults, modern observability …

Logshrink: Effective log compression by leveraging commonality and variability of log data

X Li, H Zhang, VH Le, P Chen - Proceedings of the 46th IEEE/ACM …, 2024 - dl.acm.org
Log data is a crucial resource for recording system events and states during system
execution. However, as systems grow in scale, log data generation has become increasingly …

ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems

G Yu, P Chen, Z He, Q Yan, Y Luo, F Li… - Proceedings of the ACM …, 2024 - dl.acm.org
In large-scale online service systems, the occurrence of software changes is inevitable and
frequent. Despite rigorous pre-deployment testing practices, the presence of defective …

Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis

J Huang, Z Jiang, J Liu, Y Huo, J Gu… - 2024 IEEE 35th …, 2024 - ieeexplore.ieee.org
Logs are imperative in the maintenance of online service systems, which often encompass
important information for effective failure mitigation. While existing anomaly detection …

How to Manage Change-Induced Incidents? Lessons from the Study of Incident Life Cycle

Y Zhao, L Jiang, Y Tao, S Zhang, C Wu… - 2023 IEEE 34th …, 2023 - ieeexplore.ieee.org
In online service systems, software changes cause a majority of incidents (ie, unplanned
interruptions and outages). Managing change-induced incidents efficiently is crucial for …

FC: Adaptive Atomic Commit via Failure Detection

H Pan, QT Ta, M Zhang, Z Zhao… - 2024 IEEE 40th …, 2024 - ieeexplore.ieee.org
Atomic commit protocols (ACPs) are crucial for ensuring transaction atomicity in distributed
transaction processing. However, existing ACPs, designed specifically for fixed failure …

DeployFix: Dynamic Repair of Software Deployment Failures via Constraint Solving

H Liao, J Guo, B Huang, Y Han, D Yang, K Shi… - Proceedings of the 39th …, 2024 - dl.acm.org
Software deployment misconfiguration often happens and has been one of the major causes
of deployment failures that give rise to service interruptions. However, there is currently no …

MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice Applications

H Chen, P Chen, G Yu, X Li, Z He… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Microservice is a widely-adopted architecture for constructing cloud-native applications. To
test application resiliency, chaos engineering is widely used to inject faults proactively in …

MetricSifter: Feature Reduction of Multivariate Time Series Data for Efficient Fault Localization in Cloud Applications

Y Tsubouchi, H Tsuruta - IEEE Access, 2024 - ieeexplore.ieee.org
Automated fault localization in large-scale cloud-based applications is challenging because
it involves mining multivariate time series data from large volumes of operational monitoring …