Minerva: A reinforcement learning-based technique for optimal scheduling and bottleneck detection in distributed factory operations

TE Thomas, J Koo, S Chaterji… - 2018 10th international …, 2018 - ieeexplore.ieee.org
In manufacturing systems, the term bottleneck refers to a component that limits the entire
throughput of a system. A number of approaches have attempted bottleneck detection …

Adaptive performance anomaly detection in distributed systems using online svms

JA Cid-Fuentes, C Szabo… - IEEE Transactions on …, 2018 - ieeexplore.ieee.org
Performance anomaly detection is crucial for long running, large scale distributed systems.
However, existing works focus on the detection of specific types of anomalies, rely on …

A conceptual framework for HPC operational data analytics

A Netti, W Shin, M Ott, T Wilde… - 2021 IEEE International …, 2021 - ieeexplore.ieee.org
This paper provides a broad framework for understanding trends in Operational Data
Analytics (ODA) for High-Performance Computing (HPC) facilities. The goal of ODA is to …

Causil: Causal graph for instance level microservice data

S Chakraborty, S Garg, S Agarwal, A Chauhan… - Proceedings of the …, 2023 - dl.acm.org
AI-based monitoring has become crucial for cloud-based services due to its scale. A
common approach to AI-based monitoring is to detect causal relationships among service …

Operational data analytics in practice: experiences from design to deployment in production HPC environments

A Netti, M Ott, C Guillen, D Tafani, M Schulz - Parallel Computing, 2022 - Elsevier
As HPC systems continue to grow in scale and complexity, efficient and manageable
operation is increasingly critical. For this reason, many centers are starting to explore the …

Dependency analysis of cloud applications for performance monitoring using recurrent neural networks

SY Shah, Z Yuan, S Lu, P Zerfos - 2017 IEEE International …, 2017 - ieeexplore.ieee.org
Performance monitoring of cloud-native applications that consist of several micro-services
involves the analysis of time series data collected from the infrastructure, platform, and …

Correlation-wise smoothing: Lightweight knowledge extraction for HPC monitoring data

A Netti, D Tafani, M Ott, M Schulz - 2021 IEEE International …, 2021 - ieeexplore.ieee.org
Modern High-Performance Computing (HPC) and data center operators rely more and more
on data analytics techniques to improve the efficiency and reliability of their operations. They …

Sirius: Neural network based probabilistic assertions for detecting silent data corruption in parallel programs

TE Thomas, AJ Bhattad, S Mitra… - 2016 IEEE 35th …, 2016 - ieeexplore.ieee.org
The size and complexity of supercomputing clusters are rapidly increasing to cater to the
needs of complex scientific applications. At the same time, the feature size and operating …

Real time learning evaluation based on gaze tracking

J Yi, B Sheng, R Shen, W Lin… - 2015 14th International …, 2015 - ieeexplore.ieee.org
In this paper, we present a system that extracts the information implied by eye movements
and use this information to analyze students' learning behavior. Our system uses a common …

Dealing with the unknown: Resilience to prediction errors

S Mitra, G Bronevetsky, S Javagal… - … Conference on Parallel …, 2015 - ieeexplore.ieee.org
Accurate prediction of applications' performance and functional behavior is a critical
component for a widerange of tools, including anomaly detection, task scheduling and …