Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system

C Yang, T Kurth, S Williams - Concurrency and Computation …, 2020 - Wiley Online Library
The Roofline performance model provides an intuitive and insightful approach to identifying
performance bottlenecks and guiding performance optimization. In preparation for the next …

A survey of software techniques to emulate heterogeneous memory systems in high-performance computing

C Foyer, B Goglin, AR Proaño - Parallel Computing, 2023 - Elsevier
Heterogeneous memory will be involved in several upcoming platforms on the way to
exascale. Combining technologies such as HBM, DRAM and/or NVDIMM allows to tackle …

Exploring the performance benefit of hybrid memory system on HPC environments

IB Peng, R Gioiosa, G Kestor, P Cicotti… - 2017 IEEE …, 2017 - ieeexplore.ieee.org
Hardware accelerators have become a de-facto standard to achieve high performance on
current supercomputers and there are indications that this trend will increase in the future …

A survey on evaluating and optimizing performance of Intel Xeon Phi

S Mittal - Concurrency and Computation: Practice and …, 2020 - Wiley Online Library
Summary Intel's Xeon Phi combines the parallel processing power of a many‐core
accelerator with the programming ease of CPUs. In this paper, we present a survey of works …

[HTML][HTML] H2M: exploiting heterogeneous shared memory architectures

J Klinkenberg, A Kozhokanova, C Terboven… - Future Generation …, 2023 - Elsevier
Over the past decades, the performance gap between the memory subsystem and compute
capabilities continued to spread. However, scientific applications and simulations show …

10 years later: Cloud computing is closing the performance gap

G Guidi, M Ellis, A Buluç, K Yelick, D Culler - Companion of the ACM …, 2021 - dl.acm.org
Can cloud computing infrastructures provide HPC-competitive performance for scientific
applications broadly? Despite prolific related literature, this question remains open. Answers …

Vectorized parallel sparse matrix-vector multiplication in PETSc using AVX-512

H Zhang, RT Mills, K Rupp, BF Smith - Proceedings of the 47th …, 2018 - dl.acm.org
Emerging many-core CPU architectures with high degrees of single-instruction, multiple
data (SIMD) parallelism promise to enable increasingly ambitious simulations based on …

Characterizing the performance benefit of hybrid memory system for HPC applications

IB Peng, R Gioiosa, G Kestor, JS Vetter, P Cicotti… - Parallel Computing, 2018 - Elsevier
Heterogenous memory systems that consist of multiple memory technologies are becoming
common in high-performance computing environments. Modern processors and …

Evaluation of hpc application i/o on object storage systems

J Liu, Q Koziol, GF Butler, N Fortner… - 2018 IEEE/ACM 3rd …, 2018 - ieeexplore.ieee.org
POSIX-based parallel file systems provide strong consistency semantics, which many
modern HPC applications do not need and do not want. Object store technologies avoid …

Joins on high-bandwidth memory: a new level in the memory hierarchy

C Pohl, KU Sattler, G Graefe - The VLDB Journal, 2020 - Springer
High-bandwidth memory (HBM) gives an additional opportunity for hardware performance
benefits. The high available bandwidth compared to regular DRAM allows execution of …