uiCA: Accurate throughput prediction of basic blocks on recent Intel microarchitectures

A Abel, J Reineke - Proceedings of the 36th ACM International …, 2022 - dl.acm.org
Performance models that statically predict the steady-state throughput of basic blocks on
particular microarchitectures, such as IACA, Ithemal, llvm-mca, OSACA, or CQA, can guide …

BHive: A benchmark suite and measurement framework for validating x86-64 basic block performance models

Y Chen, A Brahmakshatriya, C Mendis… - 2019 IEEE …, 2019 - ieeexplore.ieee.org
Compilers and performance engineers use hardware performance models to simplify
program optimizations. Performance models provide a necessary abstraction over complex …

Facile: Fast, accurate, and interpretable basic-block throughput prediction

A Abel, S Sharma, J Reineke - 2023 IEEE International …, 2023 - ieeexplore.ieee.org
Basic-block throughput models such as uiCA, IACA, GRANITE, Ithemal, llvm-mca, OSACA,
or CQA guide optimizing compilers and help performance engineers identify and eliminate …

Uncovering the performance bottleneck of modern HPC processor with static code analyzer: a case study on Kunpeng 920

S Tan, Q Jiang, Z Cao, X Hao, J Chen, H An - CCF Transactions on High …, 2024 - Springer
The performance of high-performance computing (HPC) and other real-world applications is
becoming unpredictable as the micro-architecture of the modern central processing unit …

Custom High-Performance Vector Code Generation for Data-Specific Sparse Computations

M Horro, LN Pouchet, G Rodríguez… - Proceedings of the …, 2022 - dl.acm.org
Sparse computations, such as sparse matrix-dense vector multiplication, are notoriously
hard to optimize due to their irregularity and memory-boundedness. Solutions to improve the …

Evaluating the effectiveness of a vector-length-agnostic instruction set

A Poenaru, S McIntosh-Smith - Euro-Par 2020: Parallel Processing: 26th …, 2020 - Springer
In this paper we evaluate the efficacy of the Arm Scalable Vector Extension (SVE) instruction
set for HPC workloads using a set of established mini-apps. Exploiting the vector capabilities …

Vectorization cost modeling for NEON, AVX and SVE

A Pohl, B Cosenza, B Juurlink - Performance Evaluation, 2020 - Elsevier
Compiler optimization passes employ cost models to determine if a code transformation will
yield performance improvements. When this assessment is inaccurate, compilers apply …

Towards automated construction of compiler optimizations

TCY Mendis - 2020 - dspace.mit.edu
First, we present goSLP, a framework that uses integer linear programming to find a globally
pairwise-optimal statement packing strategy to achieve superior vectorization performance …

[PDF][PDF] Modern vector architectures for high-performance computing

A Poenaru - 2022 - research-information.bris.ac.uk
Recent generations of general-purpose central processing units (CPUs) for the high-
performance segment have had to adopt new approaches in order to deliver increasing …

Accurate energy and performance prediction for frequency-scaled GPU kernels

K Fan, B Cosenza, B Juurlink - Computation, 2020 - mdpi.com
Energy optimization is an increasingly important aspect of today's high-performance
computing applications. In particular, dynamic voltage and frequency scaling (DVFS) has …