Fault injection analytics: A novel approach to discover failure modes in cloud-computing systems

D Cotroneo, L De Simone, P Liguori… - IEEE transactions on …, 2020 - ieeexplore.ieee.org
Cloud computing systems fail in complex and unexpected ways due to unexpected
combinations of events and interactions between hardware and software components. Fault …

Enhancing failure propagation analysis in cloud computing systems

D Cotroneo, L De Simone, P Liguori… - 2019 IEEE 30th …, 2019 - ieeexplore.ieee.org
In order to plan for failure recovery, the designers of cloud systems need to understand how
their system can potentially fail. Unfortunately, analyzing the failure behavior of such …

An approach to cloud execution failure diagnosis based on exception logs in openstack

Y Yuan, W Shi, B Liang, B Qin - 2019 IEEE 12th International …, 2019 - ieeexplore.ieee.org
Cloud is getting ubiquitous and scales up rapidly. It is critical to effectively detect and
efficiently repair system anomalies for a robust cloud. Many efforts have been made to …

Understanding, detecting and localizing partial failures in large system software

C Lou, P Huang, S Smith - 17th USENIX Symposium on Networked …, 2020 - usenix.org
Partial failures occur frequently in cloud systems and can cause serious damage including
inconsistency and data loss. Unfortunately, these failures are not well understood. Nor can …

[HTML][HTML] Run-time failure detection via non-intrusive event analysis in a large-scale cloud computing platform

D Cotroneo, L De Simone, P Liguori… - Journal of Systems and …, 2023 - Elsevier
Cloud computing systems fail in complex and unforeseen ways due to unexpected
combinations of events and interactions among hardware and software components. These …

Perfcompass: Online performance anomaly fault localization and inference in infrastructure-as-a-service clouds

DJ Dean, H Nguyen, P Wang, X Gu… - … on Parallel and …, 2015 - ieeexplore.ieee.org
Infrastructure-as-a-service clouds are becoming widely adopted. However, resource sharing
and multi-tenancy have made performance anomalies a top concern for users. Timely …

Enhancing the analysis of software failures in cloud computing systems with deep learning

D Cotroneo, L De Simone, P Liguori… - Journal of Systems and …, 2021 - Elsevier
Identifying the failure modes of cloud computing systems is a difficult and time-consuming
task, due to the growing complexity of such systems, and the large volume and noisiness of …

Predicting failures in multi-tier distributed systems

L Mariani, M Pezzè, O Riganelli, R Xin - Journal of Systems and Software, 2020 - Elsevier
Many applications are implemented as multi-tier software systems, and are executed on
distributed infrastructures, like cloud infrastructures, to benefit from the cost reduction that …

Detection is better than cure: A cloud incidents perspective

V Ganatra, A Parayil, S Ghosh, Y Kang, M Ma… - Proceedings of the 31st …, 2023 - dl.acm.org
Cloud providers use automated watchdogs or monitors to continuously observe service
availability and to proactively report incidents when system performance degrades. Improper …

Predicting cloud-native application failures based on monitoring data of cloud infrastructure

L Toka, G Dobreff, D Haja… - 2021 IFIP/IEEE …, 2021 - ieeexplore.ieee.org
The quality of service provided by cloud-deployed online applications is often affected by
faults in the underlying cloud platform and infrastructure. In order to discover the cause and …