Extendable pattern-oriented optimization directives

S Grauer-Gray, L Xu, R Searles… - 2012 innovative …, 2012 - ieeexplore.ieee.org

Determining the best set of optimizations to apply to a kernel to be executed on the graphics
processing unit (GPU) is a challenging problem. There are large sets of possible …

被引用次数：543 相关文章所有 16 个版本

[PDF] github.io

AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs

Q Wang, X Zhang, Y Zhang, Q Yi - Proceedings of the international …, 2013 - dl.acm.org

Basic Liner algebra subprograms (BLAS) is a fundamental library in scientific computing. In
this paper, we present a template-based optimization framework, AUGEM, which can …

被引用次数：290 相关文章所有 10 个版本

Model-driven level 3 BLAS performance optimization on Loongson 3A processor

Z Xianyi, W Qian, Z Yunquan - 2012 IEEE 18th international …, 2012 - ieeexplore.ieee.org

Every mainstream processor vendor provides an optimized BLAS implementation for its
CPU, as BLAS is a fundamental math library in scientific computing. The Loongson 3A CPU …

被引用次数：266 相关文章所有 4 个版本

[PDF] carch.ac.cn

Predicting cross-core performance interference on multicore processors with regression analysis

J Zhao, H Cui, J Xue, X Feng - IEEE Transactions on Parallel …, 2015 - ieeexplore.ieee.org

Despite their widespread adoption in cloud computing, multicore processors are heavily
under-utilized in terms of computing resources. To avoid the potential for negative and …

被引用次数：36 相关文章所有 7 个版本

[PDF] psu.edu

An empirical model for predicting cross-core performance interference on multicore processors

J Zhao, X Feng, H Cui, Y Yan, J Xue… - Proceedings of the …, 2013 - ieeexplore.ieee.org

Despite their widespread adoption in cloud computing, multicore processors are heavily
under-utilized in terms of computing resources. To avoid the potential for negative and …

被引用次数：51 相关文章所有 9 个版本

[PDF] unsw.edu.au

Automatic generation of fast BLAS3-GEMM: A portable compiler approach

X Su, X Liao, J Xue - 2017 IEEE/ACM International Symposium …, 2017 - ieeexplore.ieee.org

GEMM is the main computational kernel in BLAS3. Its micro-kernel is either hand-crafted in
assembly code or generated from C code by general-purpose compilers (guided by …

被引用次数：19 相关文章所有 4 个版本

[PDF] github.io

Bandwidth-aware loop tiling for dma-supported scratchpad memory

M Wu, Y Liu, H Cui, Q Wei, Q Li, L Li, F Lv… - Proceedings of the …, 2020 - dl.acm.org

Scratchpad Memory (SPM) is widely used in emerging domain-specific architectures and
accelerators for improving energy efficiency and time predictability. Typically, SPM-based …

被引用次数：10 相关文章所有 2 个版本

[PDF] github.io

Automatic library generation for BLAS3 on GPUs

H Cui, L Wang, J Xue, Y Yang… - 2011 IEEE International …, 2011 - ieeexplore.ieee.org

High-performance libraries, the performance-critical building blocks for high-level
applications, will assume greater importance on modern processors as they become more …

被引用次数：42 相关文章所有 11 个版本

[PDF] ict.ac.cn

Yet another intelligent code-generating system: A flexible and low-cost solution

JF Filho, LGA Rodriguez, AF da Silva - Journal of Computer Science and …, 2018 - Springer

Modern compilers apply various code transformation algorithms to improve the quality of the
target code. However, a complex problem is to determine which transformation algorithms …

被引用次数：13 相关文章所有 6 个版本

[PDF] researchgate.net

Godson-T: An efficient many-core processor exploring thread-level parallelism

D Fan, H Zhang, D Wang, X Ye, F Song, G Li… - IEEE Micro, 2012 - ieeexplore.ieee.org

Godson-T is a research many-core processor designed for parallel scientific computing that
delivers efficient performance and flexible programmability simultaneously. It also has many …

被引用次数：31 相关文章所有 8 个版本

高级搜索

QQ 群