Fast implementation of DGEMM on Fermi GPU

P Hijma, S Heldens, A Sclocco… - ACM Computing …, 2023 - dl.acm.org

In the past decade, Graphics Processing Units have played an important role in the field of
high-performance computing and they still advance new fields such as IoT, autonomous …

被引用次数：67 相关文章所有 3 个版本

[PDF] academia.edu

cudnn: Efficient primitives for deep learning

S Chetlur, C Woolley, P Vandermersch… - arXiv preprint arXiv …, 2014 - arxiv.org

We present a library of efficient implementations of deep learning primitives. Deep learning
workloads are computationally intensive, and optimizing their kernels is difficult and time …

被引用次数：2401 相关文章所有 9 个版本

[PDF] iczhiku.com

Dissecting the NVIDIA volta GPU architecture via microbenchmarking

Z Jia, M Maggioni, B Staiger, DP Scarpazza - arXiv preprint arXiv …, 2018 - arxiv.org

Every year, novel NVIDIA GPU designs are introduced. This rapid architectural and
technological progression, coupled with a reluctance by manufacturers to disclose low-level …

被引用次数：386 相关文章所有 4 个版本

[PDF] arxiv.org

Digital electronics and analog photonics for convolutional neural networks (DEAP-CNNs)

V Bangari, BA Marquez, H Miller, AN Tait… - IEEE Journal of …, 2019 - ieeexplore.ieee.org

Convolutional Neural Networks (CNNs) are powerful and highly ubiquitous tools for
extracting features from large datasets for applications such as computer vision and natural …

被引用次数：231 相关文章所有 8 个版本

[PDF] arxiv.org

Revisiting co-processing for hash joins on the coupled cpu-gpu architecture

J He, M Lu, B He - arXiv preprint arXiv:1307.1955, 2013 - arxiv.org

Query co-processing on graphics processors (GPUs) has become an effective means to
improve the performance of main memory databases. However, the relatively low bandwidth …

被引用次数：181 相关文章所有 13 个版本

[PDF] utk.edu

Performance, design, and autotuning of batched GEMM for GPUs

A Abdelfattah, A Haidar, S Tomov… - … Conference, ISC High …, 2016 - Springer

The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in
dense linear algebra, and is the key component for obtaining high performance in most …

被引用次数：146 相关文章所有 10 个版本

Autotuning GEMM kernels for the Fermi GPU

J Kurzak, S Tomov, J Dongarra - IEEE Transactions on Parallel …, 2012 - ieeexplore.ieee.org

In recent years, the use of graphics chips has been recognized as a viable way of
accelerating scientific and engineering applications, even more so since the introduction of …

被引用次数：162 相关文章所有 5 个版本

[PDF] hal.science

Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs

J Lai, A Seznec - Proceedings of the 2013 IEEE/ACM …, 2013 - ieeexplore.ieee.org

In this paper, we present an approach to estimate GPU applications' performance upper
bound based on algorithm analysis and assembly code level benchmarking. As an example …

被引用次数：127 相关文章所有 15 个版本

[PDF] 115.27.240.201

A coordinated tiling and batching framework for efficient GEMM on GPUs

X Li, Y Liang, S Yan, L Jia, Y Li - Proceedings of the 24th symposium on …, 2019 - dl.acm.org

General matrix multiplication (GEMM) plays a paramount role in a broad range of domains
such as deep learning, scientific computing, and image processing. The primary …

被引用次数：68 相关文章所有 4 个版本

[PDF] acm.org

gpucc: an open-source GPGPU compiler

J Wu, A Belevich, E Bendersky, M Heffernan… - Proceedings of the …, 2016 - dl.acm.org

Graphics Processing Units have emerged as powerful accelerators for massively parallel,
numerically intensive workloads. The two dominant software models for these devices are …

被引用次数：90 相关文章所有 18 个版本

高级搜索

QQ 群