Real-time incident prediction for online service systems

N Zhao, J Chen, Z Wang, X Peng, G Wang… - Proceedings of the 28th …, 2020 - dl.acm.org
N Zhao, J Chen, Z Wang, X Peng, G Wang, Y Wu, F Zhou, Z Feng, X Nie, W Zhang, K Sui
Proceedings of the 28th ACM Joint Meeting on European Software Engineering …, 2020dl.acm.org
Incidents in online service systems could dramatically degrade system availability and
destroy user experience. To guarantee service quality and reduce economic loss, it is
essential to predict the occurrence of incidents in advance so that engineers can take some
proactive actions to prevent them. In this work, we propose an effective and interpretable
incident prediction approach, called eWarn, which utilizes historical data to forecast whether
an incident will happen in the near future based on alert data in real time. More specifically …
Incidents in online service systems could dramatically degrade system availability and destroy user experience. To guarantee service quality and reduce economic loss, it is essential to predict the occurrence of incidents in advance so that engineers can take some proactive actions to prevent them. In this work, we propose an effective and interpretable incident prediction approach, called eWarn, which utilizes historical data to forecast whether an incident will happen in the near future based on alert data in real time. More specifically, eWarn first extracts a set of effective features (including textual features and statistical features) to represent omen alert patterns via careful feature engineering. To reduce the influence of noisy alerts (that are not relevant to the occurrence of incidents), eWarn then incorporates the multi-instance learning formulation. Finally, eWarn builds a classification model via machine learning and generates an interpretable report about the prediction result via a state-of-the-art explanation technique (i.e., LIME). In this way, an early warning signal along with its interpretable report can be sent to engineers to facilitate their understanding and handling for the incoming incident. An extensive study on 11 real-world online service systems from a large commercial bank demonstrates the effectiveness of eWarn, outperforming state-of-the-art alert-based incident prediction approaches and the practice of incident prediction with alerts. In particular, we have applied eWarn to two large commercial banks in practice and shared some success stories and lessons learned from real deployment.
ACM Digital Library
以上显示的是最相近的搜索结果。 查看全部搜索结果