The Zoltan and Isorropia parallel toolkits for combinatorial scientific computing: Partitioning, ordering and coloring

EG Boman, ÜV Çatalyürek, C Chevalier… - Scientific …, 2012 - content.iospress.com
Partitioning and load balancing are important problems in scientific computing that can be
modeled as combinatorial problems using graphs or hypergraphs. The Zoltan toolkit was …

Proactive fault tolerance using preemptive migration

C Engelmann, GR Vallee, T Naughton… - 2009 17th Euromicro …, 2009 - ieeexplore.ieee.org
Proactive fault tolerance (FT) in high-performance computing is a concept that prevents
compute node failures from impacting running parallel applications by preemptively …

Comprehensive resource use monitoring for HPC systems with TACC stats

T Evans, WL Barth, JC Browne… - … Workshop on HPC …, 2014 - ieeexplore.ieee.org
This paper reports on a comprehensive, fully automated resource use monitoring package,
TACC Stats, which enables both consultants, users and other stakeholders in an HPC …

A review of supercomputer performance monitoring systems

KS Stefanov, S Pawar, A Ranjan… - Supercomputing …, 2021 - superfri.susu.ru
Abstract High Performance Computing is now one of the emerging fields in computer
science and its applications. Top HPC facilities, supercomputers, offer great opportunities in …

Resource monitoring and management with OVIS to enable HPC in cloud computing environments

J Brandt, A Gentile, J Mayo, P Pebay… - … on Parallel & …, 2009 - ieeexplore.ieee.org
Using the cloud computing paradigm, a host of companies promise to make huge compute
resources available to users on a pay-as-you-go basis. These resources can be configured …

Monster: an out-of-the-box monitoring tool for high performance computing systems

J Li, G Ali, N Nguyen, J Hass, A Sill… - … on Cluster Computing …, 2020 - ieeexplore.ieee.org
Understanding the status of high-performance computing platforms and correlating
applications to resource usage provide insight into the interactions among platform …

Monitoring and predicting hardware failures in HPC clusters with FTB-IPMI

R Rajachandrasekar, X Besseron… - 2012 IEEE 26th …, 2012 - ieeexplore.ieee.org
Fault-detection and prediction in HPC clusters and Cloud-computing systems are
increasingly challenging issues. Several system middleware such as job schedulers and …

Window-based, discontinuity preserving stereo

M Agrawal, LS Davis - Proceedings of the 2004 IEEE Computer …, 2004 - ieeexplore.ieee.org
Traditionally, the problem of stereo matching has been addressed either by a local window-
based approach or a dense pixel-based approach using global optimization. In this paper …

Extending LDMS to enable performance monitoring in multi-core applications

S Feldman, D Zhang, D Dechev… - 2015 IEEE International …, 2015 - ieeexplore.ieee.org
Identifying design patterns that limit the performance of multi-core algorithms is a
challenging task. There are many known methods by which threads synchronize their …

Understanding application and system performance through system-wide monitoring

RT Evans, JC Browne, WL Barth - 2016 IEEE International …, 2016 - ieeexplore.ieee.org
TACC Stats is a continuous monitoring tool for HPC systems that collects data at the core
and process level for every job executing on a monitored system. That data can be …