Auto-tuning a high-level language targeted to GPU codes

S Grauer-Gray, L Xu, R Searles… - 2012 innovative …, 2012 - ieeexplore.ieee.org
Determining the best set of optimizations to apply to a kernel to be executed on the graphics
processing unit (GPU) is a challenging problem. There are large sets of possible …

AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs

Q Wang, X Zhang, Y Zhang, Q Yi - Proceedings of the international …, 2013 - dl.acm.org
Basic Liner algebra subprograms (BLAS) is a fundamental library in scientific computing. In
this paper, we present a template-based optimization framework, AUGEM, which can …

Model-driven level 3 BLAS performance optimization on Loongson 3A processor

Z Xianyi, W Qian, Z Yunquan - 2012 IEEE 18th international …, 2012 - ieeexplore.ieee.org
Every mainstream processor vendor provides an optimized BLAS implementation for its
CPU, as BLAS is a fundamental math library in scientific computing. The Loongson 3A CPU …

Predicting cross-core performance interference on multicore processors with regression analysis

J Zhao, H Cui, J Xue, X Feng - IEEE Transactions on Parallel …, 2015 - ieeexplore.ieee.org
Despite their widespread adoption in cloud computing, multicore processors are heavily
under-utilized in terms of computing resources. To avoid the potential for negative and …

An empirical model for predicting cross-core performance interference on multicore processors

J Zhao, X Feng, H Cui, Y Yan, J Xue… - Proceedings of the …, 2013 - ieeexplore.ieee.org
Despite their widespread adoption in cloud computing, multicore processors are heavily
under-utilized in terms of computing resources. To avoid the potential for negative and …

Automatic generation of fast BLAS3-GEMM: A portable compiler approach

X Su, X Liao, J Xue - 2017 IEEE/ACM International Symposium …, 2017 - ieeexplore.ieee.org
GEMM is the main computational kernel in BLAS3. Its micro-kernel is either hand-crafted in
assembly code or generated from C code by general-purpose compilers (guided by …

Bandwidth-aware loop tiling for dma-supported scratchpad memory

M Wu, Y Liu, H Cui, Q Wei, Q Li, L Li, F Lv… - Proceedings of the …, 2020 - dl.acm.org
Scratchpad Memory (SPM) is widely used in emerging domain-specific architectures and
accelerators for improving energy efficiency and time predictability. Typically, SPM-based …

Automatic library generation for BLAS3 on GPUs

H Cui, L Wang, J Xue, Y Yang… - 2011 IEEE International …, 2011 - ieeexplore.ieee.org
High-performance libraries, the performance-critical building blocks for high-level
applications, will assume greater importance on modern processors as they become more …

Yet another intelligent code-generating system: A flexible and low-cost solution

JF Filho, LGA Rodriguez, AF da Silva - Journal of Computer Science and …, 2018 - Springer
Modern compilers apply various code transformation algorithms to improve the quality of the
target code. However, a complex problem is to determine which transformation algorithms …

Godson-T: An efficient many-core processor exploring thread-level parallelism

D Fan, H Zhang, D Wang, X Ye, F Song, G Li… - IEEE Micro, 2012 - ieeexplore.ieee.org
Godson-T is a research many-core processor designed for parallel scientific computing that
delivers efficient performance and flexible programmability simultaneously. It also has many …