Recent advances in fault localization in computer networks

A Dusia, AS Sethi - IEEE Communications Surveys & Tutorials, 2016 - ieeexplore.ieee.org
Fault localization, a core element in network fault management, is the process of inferring
the exact failure in a network from the set of observed symptoms. Since faults in network …

Causal modeling based fault localization in cloud systems using golden signals

P Aggarwal, S Nagar, A Gupta… - 2021 IEEE 14th …, 2021 - ieeexplore.ieee.org
In cloud-native applications, a large fraction of operational failures, known as outages, result
in violations of Service Level Objectives (SLOs). SLOs are defined around specific …

Performance degradation root cause prediction in a distributed computing system

MK Agarwal, G Kar, A Neogi, A Sailer - US Patent 7,412,448, 2008 - Google Patents
A method of identifying at least one resource in a distributed computing system which is a
potential root cause of performance degradation of the system includes the steps of …

Problem determination in enterprise middleware systems using change point correlation of time series data

MK Agarwal, M Gupta, V Mann… - 2006 IEEE/IFIP …, 2006 - ieeexplore.ieee.org
Clustered enterprise middleware systems employing dynamic workload scheduling are
susceptible to a variety of application malfunctions that can manifest themselves in a …

Leveraging many simple statistical models to adaptively monitor software systems

MA Munawar, PAS Ward - … Symposium, ISPA 2007 Niagara Falls, Canada …, 2007 - Springer
Self-managing systems require continuous monitoring to ensure correct operation. Detailed
monitoring is often too costly to use in production. An alternative is adaptive monitoring …

Application of adaptive probing for fault diagnosis in computer networks

M Natu, AS Sethi - NOMS 2008-2008 IEEE Network Operations …, 2008 - ieeexplore.ieee.org
This dissertation presents an adaptive probing based tool for fault diagnosis in computer
networks by addressing the problems of probe station selection and probe selection. We first …

Efficient control of false negative and false positive errors with separate adaptive thresholds

D Breitgand, M Goldstein, E Henis… - IEEE Transactions on …, 2011 - ieeexplore.ieee.org
Component level performance thresholds are widely used as a basic means for
performance management. As the complexity of managed applications increases, manual …

Performance degradation root cause prediction in a distributed computing system

MK Agarwal, G Kar, A Neogi, A Sailer - US Patent 8,161,058, 2012 - Google Patents
(57) ABSTRACT A method of identifying at least one resource in a distributed computing
system which is a potential root cause of perfor mance degradation of the system includes …

Detection and diagnosis of recurrent faults in software systems by invariant analysis

M Jiang, MA Munawar, T Reidemeister… - 2008 11th IEEE High …, 2008 - ieeexplore.ieee.org
A correctly functioning enterprise-software system exhibits long-term, stable correlations
between many of its monitoring metrics. Some of these correlations no longer hold when …

Hardware performance counter-based problem diagnosis for e-commerce systems

KA Bare, S Kavulya… - 2010 IEEE Network …, 2010 - ieeexplore.ieee.org
Black-box instrumentation can support problem diagnosis in distributed systems without the
need to modify the application code or to understand its semantics. We explore a novel, low …