Aiops solutions for incident management: Technical guidelines and a comprehensive literature review

Y Remil, A Bendimerad, R Mathonat… - arXiv preprint arXiv …, 2024 - arxiv.org
The management of modern IT systems poses unique challenges, necessitating scalability,
reliability, and efficiency in handling extensive data streams. Traditional methods, reliant on …

ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems

G Yu, P Chen, Z He, Q Yan, Y Luo, F Li… - Proceedings of the ACM …, 2024 - dl.acm.org
In large-scale online service systems, the occurrence of software changes is inevitable and
frequent. Despite rigorous pre-deployment testing practices, the presence of defective …

Detection Latencies of Anomaly Detectors: An Overlooked Perspective?

T Puccetti, A Ceccarelli - arXiv preprint arXiv:2402.09082, 2024 - arxiv.org
The ever-evolving landscape of attacks, coupled with the growing complexity of ICT systems,
makes crafting anomaly-based intrusion detectors (ID) and error detectors (ED) a difficult …

On the Difficulty of Identifying Incident-Inducing Changes

E Kapel, L Cruz, D Spinellis… - Proceedings of the 46th …, 2024 - dl.acm.org
Effective change management is crucial for businesses heavily reliant on software and
services to minimise incidents induced by changes. Unfortunately, in practice it is often …

[PDF][PDF] Guardian of the Resiliency: Detecting Erroneous Software Changes Before They Make Your Microservice System Less Fault-Resilient

G He, X Nie, R Tang, K Wang, Z Yu, X Wen, K Yin… - netman.aiops.org
The microservice system's resilience is crucial for ensuring the quality of service. Nowadays,
software changes are frequent and error-prone, and erroneous software changes could …