Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data

G Yu, P Chen, Y Li, H Chen, X Li, Z Zheng - Proceedings of the 31st …, 2023 - dl.acm.org
Root cause analysis (RCA) in large-scale microservice systems is a critical and challenging
task. To understand and localize root causes of unexpected faults, modern observability …

TraceCRL: contrastive representation learning for microservice trace analysis

C Zhang, X Peng, T Zhou, C Sha, Z Yan… - Proceedings of the 30th …, 2022 - dl.acm.org
Due to the large amount and high complexity of trace data, microservice trace analysis tasks
such as anomaly detection, fault diagnosis, and tail-based sampling widely adopt machine …

STEAM: observability-preserving trace sampling

S He, B Feng, L Li, X Zhang, Y Kang, Q Lin… - Proceedings of the 31st …, 2023 - dl.acm.org
In distributed systems and microservice applications, tracing is a crucial observability signal
employed for comprehending their internal states. To mitigate the overhead associated with …

Logreducer: Identify and reduce log hotspots in kernel on the fly

G Yu, P Chen, P Li, T Weng, H Zheng… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
Modern systems generate a massive amount of logs to detect and diagnose system faults,
which incurs expensive storage costs and runtime overhead. After investigating real-world …

Trastrainer: Adaptive sampling for distributed traces with system runtime state

H Huang, X Zhang, P Chen, Z He, Z Chen… - Proceedings of the …, 2024 - dl.acm.org
Distributed tracing has been widely adopted in many microservice systems and plays an
important role in monitoring and analyzing the system. However, trace data often come in …

Tracestream: Anomalous service localization based on trace stream clustering with online feedback

T Zhou, C Zhang, X Peng, Z Yan, P Li… - 2023 IEEE 34th …, 2023 - ieeexplore.ieee.org
Modern large-scale service-based systems such as microservice systems have become
increasingly complex, making it hard to localize anomalous services when various issues …

Efficient and Robust Trace Anomaly Detection for Large-Scale Microservice Systems

S Zhang, Z Pan, H Liu, P Jin, Y Sun… - 2023 IEEE 34th …, 2023 - ieeexplore.ieee.org
Microservice invocation anomalies can have a detrimental impact on user experience and
service revenue. While existing trace anomaly detection approaches typically focus on …

Samplehst: Efficient on-the-fly selection of distributed traces

AU Gias, Y Gao, M Sheldon… - NOMS 2023-2023 …, 2023 - ieeexplore.ieee.org
Since only a small number of traces generated from distributed tracing helps in
troubleshooting, its storage requirement can be significantly reduced by biasing the …

MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice Applications

H Chen, P Chen, G Yu, X Li, Z He… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Microservice is a widely-adopted architecture for constructing cloud-native applications. To
test application resiliency, chaos engineering is widely used to inject faults proactively in …

[HTML][HTML] The Diagnosis-Effective Sampling of Application Traces

A Poghosyan, A Harutyunyan, E Davtyan… - Applied Sciences, 2024 - mdpi.com
Distributed tracing is cutting-edge technology used for monitoring, managing, and
troubleshooting native cloud applications. It offers a more comprehensive and continuous …