A comprehensive perspective on pilot-job systems

M Turilli, M Santcroos, S Jha - ACM Computing Surveys (CSUR), 2018 - dl.acm.org
Pilot-Job systems play an important role in supporting distributed scientific computing. They
are used to execute millions of jobs on several cyberinfrastructures worldwide, consuming …

Open XDMoD: A tool for the comprehensive management of high-performance computing resources

JT Palmer, SM Gallo, TR Furlani… - … in Science & …, 2015 - ieeexplore.ieee.org
Open XDMoD is an open source tool designed to facilitate the management of high-
performance computing (HPC) systems. The Open XDMoD portal provides a rich set of …

Deep analysis of job state statistics on Lomonosov-2 supercomputer

DA Nikitenko, VV Voevodin, SA Zhumatiy - … Frontiers and Innovations, 2018 - superfri.org
It is a common knowledge that the increasingly growing capabilities of HPC systems are
always limited by a number of efficiency related issues. The reasons can be very different …

First Impressions of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchip for Scientific Workloads

NA Simakov, MD Jones, TR Furlani… - Proceedings of the …, 2024 - dl.acm.org
The engineering samples of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper
Superchips were tested using different benchmarks and scientific applications. The …

Understanding application and system performance through system-wide monitoring

RT Evans, JC Browne, WL Barth - 2016 IEEE International …, 2016 - ieeexplore.ieee.org
TACC Stats is a continuous monitoring tool for HPC systems that collects data at the core
and process level for every job executing on a monitored system. That data can be …

Are we ready for broader adoption of ARM in the HPC community: Performance and Energy Efficiency Analysis of Benchmarks and Applications Executed on High …

NA Simakov, RL Deleon, JP White, MD Jones… - Proceedings of the …, 2023 - dl.acm.org
A set of benchmarks, including numerical libraries and real-world scientific applications,
were run on several modern ARM systems (Amazon Graviton 3/2, Futjutsu A64FX, Ampere …

Analysis of XDMoD/SUPReMM data using machine learning techniques

SM Gallo, JP White, RL DeLeon… - 2015 IEEE …, 2015 - ieeexplore.ieee.org
Machine learning techniques were applied to job accounting and performance data for
application classification. Job data were accumulated using the XDMoD monitoring …

Comprehensive, open‐source resource usage measurement and analysis for HPC systems

JC Browne, RL DeLeon, AK Patra… - Concurrency and …, 2014 - Wiley Online Library
The important role high‐performance computing (HPC) resources play in science and
engineering research, coupled with its high cost (capital, power and manpower), short life …

Integrating abstractions to enhance the execution of distributed applications

M Turilli, F Liu, Z Zhang, A Merzky… - 2016 IEEE …, 2016 - ieeexplore.ieee.org
One of the factors that limits the scale, performance, and sophistication of distributed
applications is the difficulty of concurrently executing them on multiple distributed computing …

Application kernels: HPC resources performance monitoring and variance analysis

NA Simakov, JP White, RL DeLeon… - Concurrency and …, 2015 - Wiley Online Library
Application kernels are computationally lightweight benchmarks or applications run
repeatedly on high performance computing (HPC) clusters in order to track the Quality of …