CauseInfer: Automated End-to-End Performance Diagnosis with Hierarchical Causality Graph in Cloud Environment

P Chen, Y Qi, D Hou - IEEE transactions on services computing, 2016 - ieeexplore.ieee.org
P Chen, Y Qi, D Hou
IEEE transactions on services computing, 2016ieeexplore.ieee.org
Modern computing systems especially cloud-based and cloud-centric systems always
consist of a mass of components running in large distributed environments with complicated
interactions. They are vulnerable to performance problems due to the highly dynamic
runtime environment changes (eg, overload and resource contention) or software bugs (eg,
memory leak). Unfortunately, it is notoriously difficult to diagnose the root causes of these
performance problems in a fine granularity due to complicated interactions and a large …
Modern computing systems especially cloud-based and cloud-centric systems always consist of a mass of components running in large distributed environments with complicated interactions. They are vulnerable to performance problems due to the highly dynamic runtime environment changes (e.g., overload and resource contention) or software bugs (e.g., memory leak). Unfortunately, it is notoriously difficult to diagnose the root causes of these performance problems in a fine granularity due to complicated interactions and a large cardinality of potential cause set. In this paper, we build an automated, black-box and end-to-end cause inference system named CauseInfer to pinpoint the root causes or at least provide some hints. CauseInfer can automatically map a distributed system to a two-layer hierarchical causality graph and infer the root causes along the causal paths in the causality graph. CauseInfer models the fault propagation paths in an explicit way and works without instrumentation to the running production system, which makes CauseInfer more effective and practical than previous approaches. The experimental evaluations in two benchmark systems show that CauseInfer can identify the root causes in a high accuracy. Compared to several state-of-the-art approaches, CauseInfer can achieve over 10 percent improvement. Moreover, CauseInfer is lightweight and flexible enough to readily scale out in large distributed systems. With CauseInfer, the mean time to recovery (MTTR) of the cloud systems can be significantly reduced.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果