Scalable kernel fusion for memory-bound GPU applications

M Wahib, N Maruyama - SC'14: Proceedings of the …, 2014 - ieeexplore.ieee.org
GPU implementations of HPC applications relying on finite difference methods can include
tens of kernels that are memory-bound. Kernel fusion can improve performance by reducing …

Optimizing CUDA code by kernel fusion: application on BLAS

J Filipovič, M Madzin, J Fousek, L Matyska - The Journal of …, 2015 - Springer
Contemporary GPUs have significantly higher arithmetic throughput than a memory
throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic …

Demystifying bert: System design implications

S Pati, S Aga, N Jayasena… - 2022 IEEE International …, 2022 - ieeexplore.ieee.org
Transfer learning in natural language processing (NLP) uses increasingly large models that
tackle challenging problems. Consequently, these applications are driving the requirements …

Automatic horizontal fusion for GPU kernels

A Li, B Zheng, G Pekhimenko… - 2022 IEEE/ACM …, 2022 - ieeexplore.ieee.org
We present automatic horizontal fusion, a novel optimization technique that complements
the standard kernel fusion techniques for GPU programs. Unlike the standard fusion, whose …

Computation vs. communication scaling for future transformers on future hardware

S Pati, S Aga, M Islam, N Jayasena… - arXiv preprint arXiv …, 2023 - arxiv.org
Scaling neural network models has delivered dramatic quality gains across ML problems.
However, this scaling has increased the reliance on efficient distributed training techniques …

CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization

P Dalmia, RS Kumar, MD Sinclair - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org
Chiplets are transforming computer system designs, allowing system designers to combine
heterogeneous computing resources at unprecedented scales. Breaking larger, mono-lithic …

Demystifying bert: Implications for accelerator design

S Pati, S Aga, N Jayasena, MD Sinclair - arXiv preprint arXiv:2104.08335, 2021 - arxiv.org
Transfer learning in natural language processing (NLP), as realized using models like BERT
(Bi-directional Encoder Representation from Transformer), has significantly improved …

gsampler: General and efficient gpu-based graph sampling for graph learning

P Gong, R Liu, Z Mao, Z Cai, X Yan, C Li… - Proceedings of the 29th …, 2023 - dl.acm.org
Graph sampling prepares training samples for graph learning and can dominate the training
time. Due to the increasing algorithm diversity and complexity, existing sampling frameworks …

Tale of Two Cs: Computation vs. Communication Scaling for Future Transformers on Future Hardware

S Pati, S Aga, M Islam, N Jayasena… - 2023 IEEE …, 2023 - ieeexplore.ieee.org
Scaling neural network models has delivered dramatic quality gains across ML problems.
However, this scaling also increased the reliance on efficient distributed training techniques …

Global Optimizations & Lightweight Dynamic Logic for Concurrency

S Pati, S Aga, N Jayasena, MD Sinclair - arXiv preprint arXiv:2409.02227, 2024 - arxiv.org
Modern accelerators like GPUs are increasingly executing independent operations
concurrently to improve the device's compute utilization. However, effectively harnessing it …