Optimization techniques for GPU programming

P Hijma, S Heldens, A Sclocco… - ACM Computing …, 2023 - dl.acm.org
In the past decade, Graphics Processing Units have played an important role in the field of
high-performance computing and they still advance new fields such as IoT, autonomous …

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

J Shah, G Bikshandi, Y Zhang, V Thakkar… - arXiv preprint arXiv …, 2024 - arxiv.org
Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for
large language models and long-context applications. FlashAttention elaborated an …

Review of data science trends and issues in porous media research with a focus on image‐based techniques

A Rabbani, AM Fernando, R Shams… - Water Resources …, 2021 - Wiley Online Library
Data science as a flourishing interdisciplinary domain of computer and mathematical
sciences is playing an important role in guiding the porous material research streams. In the …

GPUWattch: Enabling energy optimizations in GPGPUs

J Leng, T Hetherington, A ElTantawy, S Gilani… - ACM SIGARCH …, 2013 - dl.acm.org
General-purpose GPUs (GPGPUs) are becoming prevalent in mainstream computing, and
performance per watt has emerged as a more crucial evaluation metric than peak …

OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

A Jog, O Kayiran, N Chidambaram Nachiappan… - ACM SIGPLAN …, 2013 - dl.acm.org
Emerging GPGPU architectures, along with programming models like CUDA and OpenCL,
offer a cost-effective platform for many applications by providing high thread level …

Lightlda: Big topic models on modest computer clusters

J Yuan, F Gao, Q Ho, W Dai, J Wei, X Zheng… - Proceedings of the 24th …, 2015 - dl.acm.org
When building large-scale machine learning (ML) programs, such as massive topic models
or deep neural networks with up to trillions of parameters and training examples, one usually …

Hardware acceleration of database operations

J Casper, K Olukotun - Proceedings of the 2014 ACM/SIGDA …, 2014 - dl.acm.org
As the amount of memory in database systems grows, entire database tables, or even
databases, are able to fit in the system's memory, making in-memory database operations …

Orchestrated scheduling and prefetching for GPGPUs

A Jog, O Kayiran, AK Mishra, MT Kandemir… - Proceedings of the 40th …, 2013 - dl.acm.org
In this paper, we present techniques that coordinate the thread scheduling and prefetching
decisions in a General Purpose Graphics Processing Unit (GPGPU) architecture to better …

A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps

N Vijaykumar, G Pekhimenko, A Jog… - ACM SIGARCH …, 2015 - dl.acm.org
Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent
execution of thousands of threads. Unfortunately, different bottlenecks during execution and …

Scalable kernel fusion for memory-bound GPU applications

M Wahib, N Maruyama - SC'14: Proceedings of the …, 2014 - ieeexplore.ieee.org
GPU implementations of HPC applications relying on finite difference methods can include
tens of kernels that are memory-bound. Kernel fusion can improve performance by reducing …