Optimization techniques for GPU programming

P Hijma, S Heldens, A Sclocco… - ACM Computing …, 2023 - dl.acm.org
In the past decade, Graphics Processing Units have played an important role in the field of
high-performance computing and they still advance new fields such as IoT, autonomous …

cudnn: Efficient primitives for deep learning

S Chetlur, C Woolley, P Vandermersch… - arXiv preprint arXiv …, 2014 - arxiv.org
We present a library of efficient implementations of deep learning primitives. Deep learning
workloads are computationally intensive, and optimizing their kernels is difficult and time …

Dissecting the NVIDIA volta GPU architecture via microbenchmarking

Z Jia, M Maggioni, B Staiger, DP Scarpazza - arXiv preprint arXiv …, 2018 - arxiv.org
Every year, novel NVIDIA GPU designs are introduced. This rapid architectural and
technological progression, coupled with a reluctance by manufacturers to disclose low-level …

Digital electronics and analog photonics for convolutional neural networks (DEAP-CNNs)

V Bangari, BA Marquez, H Miller, AN Tait… - IEEE Journal of …, 2019 - ieeexplore.ieee.org
Convolutional Neural Networks (CNNs) are powerful and highly ubiquitous tools for
extracting features from large datasets for applications such as computer vision and natural …

Revisiting co-processing for hash joins on the coupled cpu-gpu architecture

J He, M Lu, B He - arXiv preprint arXiv:1307.1955, 2013 - arxiv.org
Query co-processing on graphics processors (GPUs) has become an effective means to
improve the performance of main memory databases. However, the relatively low bandwidth …

Performance, design, and autotuning of batched GEMM for GPUs

A Abdelfattah, A Haidar, S Tomov… - … Conference, ISC High …, 2016 - Springer
The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in
dense linear algebra, and is the key component for obtaining high performance in most …

Autotuning GEMM kernels for the Fermi GPU

J Kurzak, S Tomov, J Dongarra - IEEE Transactions on Parallel …, 2012 - ieeexplore.ieee.org
In recent years, the use of graphics chips has been recognized as a viable way of
accelerating scientific and engineering applications, even more so since the introduction of …

Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs

J Lai, A Seznec - Proceedings of the 2013 IEEE/ACM …, 2013 - ieeexplore.ieee.org
In this paper, we present an approach to estimate GPU applications' performance upper
bound based on algorithm analysis and assembly code level benchmarking. As an example …

A coordinated tiling and batching framework for efficient GEMM on GPUs

X Li, Y Liang, S Yan, L Jia, Y Li - Proceedings of the 24th symposium on …, 2019 - dl.acm.org
General matrix multiplication (GEMM) plays a paramount role in a broad range of domains
such as deep learning, scientific computing, and image processing. The primary …

gpucc: an open-source GPGPU compiler

J Wu, A Belevich, E Bendersky, M Heffernan… - Proceedings of the …, 2016 - dl.acm.org
Graphics Processing Units have emerged as powerful accelerators for massively parallel,
numerically intensive workloads. The two dominant software models for these devices are …