Intelligent colocation of HPC workloads

FV Zacarias, V Petrucci, R Nishtala, P Carpenter… - Journal of Parallel and …, 2021 - Elsevier
Many HPC applications suffer from a bottleneck in the shared caches, instruction execution
units, I/O or memory bandwidth, even though the remaining resources may be underutilized …

Exploring job running path to predict runtime on multiple production supercomputers

W Yang, X Liao, D Dong, J Yu - Journal of Parallel and Distributed …, 2023 - Elsevier
There are massive jobs submitted in the supercomputer, and the job management system is
typically deployed to schedule these jobs and allocate compute resources. FCFS (First …

Intelligent colocation of workloads for enhanced server efficiency

FV Zacarias, V Petrucci, R Nishtala… - 2019 31st …, 2019 - ieeexplore.ieee.org
Many server applications achieve only a fraction of their theoretical peak performance due to
bottlenecks in the shared caches, instruction execution units, I/O or memory bandwidth, even …

Quantifying server memory frequency margin and using it to improve performance in hpc systems

D Zhang, G Panwar, JB Kotra… - 2021 ACM/IEEE 48th …, 2021 - ieeexplore.ieee.org
To maintain strong reliability, memory manufacturers label server memories at much slower
data rates than the highest data rates at which they can still operate correctly for most (eg …

Exploring the tradeoff between reliability and performance in hpc systems

C Walker, B Slade, G Bailey… - 2021 IEEE High …, 2021 - ieeexplore.ieee.org
Evaluating the trade-off space between performance and reliability is important for data
center operators as part of their supercomputer procurement, planning and acceptance …

[PDF][PDF] Optimized hardware configuration for high performance computing systems

S Hutchison, D Andresen, W Hsu… - Proceedings of the …, 2023 - personales.upv.es
When faced with upgrading or replacing High Performance Computing or High Throughput
Computing systems, system administrators can be overwhelmed by hardware options …

Developing accurate slurm simulator

NA Simakov, RL Deleon, Y Lin, PS Hoffmann… - … and Experience in …, 2022 - dl.acm.org
A new Slurm simulator compatible with the latest Slurm version has been produced. It was
constructed by systematically transforming the Slurm code step by step to maintain the …

Optimizing the Resource and Job Management System of an Academic HPC & Research Computing Facility

S Varrette, E Kieffer, F Pinel - 2022 21st International …, 2022 - ieeexplore.ieee.org
High Performance Computing (HPC) is nowadays a strategic asset required to sustain the
surging demands for massive processing and data-analytic capabilities. In practice, the …

A resourceful coordination approach for multilevel scheduling

A Eleliemy, FM Ciorba - arXiv preprint arXiv:2103.05809, 2021 - arxiv.org
HPC users aim to improve their execution times without particular regard for increasing
system utilization. On the contrary, HPC operators favor increasing the number of executed …

DeletePop: A DLT Execution Time Predictor Based on Comprehensive Modeling

Y He, Y Zhou, E Shao, G Tan, N Sun - International Conference on …, 2023 - Springer
The modeling and simulation of Deep Learning Training (DLT) are challenging problems.
Due to the intricate parallel patterns, existing modelings and simulations do not consider …