[HTML][HTML] A review on the decarbonization of high-performance computing centers

CA Silva, R Vilaça, A Pereira, RJ Bessa - Renewable and Sustainable …, 2024 - Elsevier
High-performance computing relies on performance-oriented infrastructures with access to
powerful computing resources to complete tasks that contribute to solve complex problems …

Miso: exploiting multi-instance gpu capability on multi-tenant gpu clusters

B Li, T Patel, S Samsi, V Gadepally… - Proceedings of the 13th …, 2022 - dl.acm.org
GPU technology has been improving at an expedited pace in terms of size and performance,
empowering HPC and AI/ML researchers to advance the scientific discovery process …

A digital twin framework for liquid-cooled supercomputers as demonstrated at exascale

W Brewer, M Maiterth, V Kumar… - … Conference for High …, 2024 - ieeexplore.ieee.org
We present ExaDigiT, an open-source framework for developing comprehensive digital
twins of liquid-cooled supercomputers. It integrates three main modules:(1) a resource …

RUAD: Unsupervised anomaly detection in HPC systems

M Molan, A Borghesi, D Cesarini, L Benini… - Future Generation …, 2023 - Elsevier
The increasing complexity of modern high-performance computing (HPC) systems
necessitates the introduction of automated and data-driven methodologies to support system …

Clover: Toward sustainable ai with carbon-aware machine learning inference service

B Li, S Samsi, V Gadepally, D Tiwari - Proceedings of the International …, 2023 - dl.acm.org
This paper presents a solution to the challenge of mitigating carbon emissions from hosting
large-scale machine learning (ML) inference services. ML inference is critical to modern …

Precise energy consumption measurements of heterogeneous artificial intelligence workloads

R Caspart, S Ziegler, A Weyrauch, H Obermaier… - … Conference on High …, 2022 - Springer
With the rise of artificial intelligence (AI) in recent years and the subsequent increase in
complexity of the applied models, the growing demand in computational resources is …

Power profile monitoring and tracking evolution of system-wide hpc workloads

AM Karimi, NS Sattar, W Shin… - 2024 IEEE 44th …, 2024 - ieeexplore.ieee.org
The power & energy demands of HPC machines have grown significantly. Modern exascale
HPC systems require tens of megawatts of combined power for computing resources and …

Graph neural networks for anomaly anticipation in HPC systems

M Molan, J Ahmed Khan, A Borghesi… - Companion of the 2023 …, 2023 - dl.acm.org
In this paper, we explore the use of Graph Neural Networks (GNNs) for anomaly anticipation
in high performance computing (HPC) systems. We propose a GNN-based approach that …

Towards scalable resource management for supercomputers

Y Dai, Y Dong, K Lu, R Wang, W Zhang… - … Conference for High …, 2022 - ieeexplore.ieee.org
Today's supercomputers offer massive computation resources to execute a large number of
user jobs. Effectively managing such large-scale hardware parallelism and workloads is …

Generic and ML Workloads in an HPC Datacenter: Node Energy, Job Failures, and Node-Job Analysis

X Chu, D Hofstätter, S Ilager, S Talluri… - 2024 IEEE 30th …, 2024 - ieeexplore.ieee.org
HPC datacenters offer a backbone to the modern digital society. Increasingly, they run
Machine Learning (ML) jobs next to generic, compute-intensive workloads, supporting …