A Survey on Failure Analysis and Fault Injection in AI Systems

G Yu, G Tan, H Huang, Z Zhang, P Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid advancement of Artificial Intelligence (AI) has led to its integration into various
areas, especially with Large Language Models (LLMs) significantly enhancing capabilities …

Chaos Engineering: A Multi-Vocal Literature Review

J Owotogbe, I Kumara, WJVD Heuvel… - arXiv preprint arXiv …, 2024 - arxiv.org
Organizations, particularly medium and large enterprises, typically today rely heavily on
complex, distributed systems to deliver critical services and products. However, the growing …

DMSA: A Decentralized Microservice Architecture for Edge Networks

Y Chen, C Lu, Y Huang, C Wu, F Guo, H Lu… - arXiv preprint arXiv …, 2025 - arxiv.org
The dispersed node locations and complex topologies of edge networks, combined with
intricate dynamic microservice dependencies, render traditional centralized microservice …

Real-time and Downtime-tolerant Fault Diagnosis for Railway Turnout Machines (RTMs) Empowered with Cloud-Edge Pipeline Parallelism

F Wu, M Bilal, H Xiang, H Wang, J Yu, X Xu - arXiv preprint arXiv …, 2024 - arxiv.org
Railway Turnout Machines (RTMs) are mission-critical components of the railway
transportation infrastructure, responsible for directing trains onto desired tracks. For safety …

Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis

C Pei, Z Wang, F Liu, Z Li, Y Liu, X He, R Kang… - openreview.net
In the realm of microservices architecture, the occurrence of frequent incidents necessitates
the employment of Root Cause Analysis (RCA) for swift issue resolution. It is common that a …

[引用][C] ChaosEater: Fully Automating Chaos Engineering with Large Language Models

D Kikuta, H Ikeuchi, K Tajiri, Y Nakano - arXiv preprint arXiv:2501.11107, 2025