Automatic fusions of CUDA-GPU kernels for parallel map

M Wahib, N Maruyama - SC'14: Proceedings of the …, 2014 - ieeexplore.ieee.org

GPU implementations of HPC applications relying on finite difference methods can include
tens of kernels that are memory-bound. Kernel fusion can improve performance by reducing …

被引用次数：125 相关文章所有 7 个版本

[PDF] arxiv.org

Optimizing CUDA code by kernel fusion: application on BLAS

J Filipovič, M Madzin, J Fousek, L Matyska - The Journal of …, 2015 - Springer

Contemporary GPUs have significantly higher arithmetic throughput than a memory
throughput. Hence, many GPU kernels are memory bound and cannot exploit arithmetic …

被引用次数：105 相关文章所有 15 个版本

Demystifying bert: System design implications

S Pati, S Aga, N Jayasena… - 2022 IEEE International …, 2022 - ieeexplore.ieee.org

Transfer learning in natural language processing (NLP) uses increasingly large models that
tackle challenging problems. Consequently, these applications are driving the requirements …

被引用次数：29 相关文章所有 2 个版本

[PDF] arxiv.org

Automatic horizontal fusion for GPU kernels

A Li, B Zheng, G Pekhimenko… - 2022 IEEE/ACM …, 2022 - ieeexplore.ieee.org

We present automatic horizontal fusion, a novel optimization technique that complements
the standard kernel fusion techniques for GPU programs. Unlike the standard fusion, whose …

被引用次数：51 相关文章所有 10 个版本

[PDF] arxiv.org

Computation vs. communication scaling for future transformers on future hardware

S Pati, S Aga, M Islam, N Jayasena… - arXiv preprint arXiv …, 2023 - arxiv.org

Scaling neural network models has delivered dramatic quality gains across ML problems.
However, this scaling has increased the reliance on efficient distributed training techniques …

被引用次数：6 相关文章所有 2 个版本

[PDF] wisc.edu

CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization

P Dalmia, RS Kumar, MD Sinclair - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org

Chiplets are transforming computer system designs, allowing system designers to combine
heterogeneous computing resources at unprecedented scales. Breaking larger, mono-lithic …

被引用次数：1 相关文章所有 7 个版本

[PDF] arxiv.org

Demystifying bert: Implications for accelerator design

S Pati, S Aga, N Jayasena, MD Sinclair - arXiv preprint arXiv:2104.08335, 2021 - arxiv.org

Transfer learning in natural language processing (NLP), as realized using models like BERT
(Bi-directional Encoder Representation from Transformer), has significantly improved …

被引用次数：15 相关文章所有 3 个版本

[PDF] amazon.science

gsampler: General and efficient gpu-based graph sampling for graph learning

P Gong, R Liu, Z Mao, Z Cai, X Yan, C Li… - Proceedings of the 29th …, 2023 - dl.acm.org

Graph sampling prepares training samples for graph learning and can dominate the training
time. Due to the increasing algorithm diversity and complexity, existing sampling frameworks …

被引用次数：5 相关文章所有 3 个版本

[PDF] wisc.edu

Tale of Two Cs: Computation vs. Communication Scaling for Future Transformers on Future Hardware

S Pati, S Aga, M Islam, N Jayasena… - 2023 IEEE …, 2023 - ieeexplore.ieee.org

Scaling neural network models has delivered dramatic quality gains across ML problems.
However, this scaling also increased the reliance on efficient distributed training techniques …

被引用次数：12 相关文章所有 2 个版本

[PDF] arxiv.org

Global Optimizations & Lightweight Dynamic Logic for Concurrency

S Pati, S Aga, N Jayasena, MD Sinclair - arXiv preprint arXiv:2409.02227, 2024 - arxiv.org

Modern accelerators like GPUs are increasingly executing independent operations
concurrently to improve the device's compute utilization. However, effectively harnessing it …

被引用次数：1 相关文章所有 2 个版本

高级搜索

QQ 群