Optimizing CUDA code by kernel fusion: application on BLAS

P Hijma, S Heldens, A Sclocco… - ACM Computing …, 2023 - dl.acm.org

In the past decade, Graphics Processing Units have played an important role in the field of
high-performance computing and they still advance new fields such as IoT, autonomous …

被引用次数：65 相关文章所有 3 个版本

[PDF] acm.org

Graph IRS for impure higher-order languages: making aggressive optimizations affordable with precise effect dependencies

O Bračevac, G Wei, S Jia, S Abeysinghe… - Proceedings of the …, 2023 - dl.acm.org

Graph-based intermediate representations (IRs) are widely used for powerful compiler
optimizations, either interprocedurally in pure functional languages, or intraprocedurally in …

被引用次数：18 相关文章所有 5 个版本

[PDF] archive.org

Scalable kernel fusion for memory-bound GPU applications

M Wahib, N Maruyama - SC'14: Proceedings of the …, 2014 - ieeexplore.ieee.org

GPU implementations of HPC applications relying on finite difference methods can include
tens of kernels that are memory-bound. Kernel fusion can improve performance by reducing …

被引用次数：125 相关文章所有 7 个版本

[HTML] nih.gov

TorchMD-Net 2.0: Fast Neural Network Potentials for Molecular Simulations

RP Pelaez, G Simeon, R Galvelis… - Journal of Chemical …, 2024 - ACS Publications

Achieving a balance between computational speed, prediction accuracy, and universal
applicability in molecular simulations has been a persistent challenge. This paper presents …

被引用次数：18 相关文章所有 9 个版本

Demystifying bert: System design implications

S Pati, S Aga, N Jayasena… - 2022 IEEE International …, 2022 - ieeexplore.ieee.org

Transfer learning in natural language processing (NLP) uses increasingly large models that
tackle challenging problems. Consequently, these applications are driving the requirements …

被引用次数：29 相关文章所有 2 个版本

[PDF] arxiv.org

A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with Kernel Tuning Toolkit

F Petrovič, D Střelák, J Hozzová, J Ol'ha… - Future Generation …, 2020 - Elsevier

In recent years, the heterogeneity of both commodity and supercomputers hardware has
increased sharply. Accelerators, such as GPUs or Intel Xeon Phi co-processors, are often …

被引用次数：51 相关文章所有 8 个版本

[PDF] arxiv.org

Automatic horizontal fusion for GPU kernels

A Li, B Zheng, G Pekhimenko… - 2022 IEEE/ACM …, 2022 - ieeexplore.ieee.org

We present automatic horizontal fusion, a novel optimization technique that complements
the standard kernel fusion techniques for GPU programs. Unlike the standard fusion, whose …

被引用次数：51 相关文章所有 10 个版本

[PDF] acm.org

When ML Training Cuts Through Congestion: Just-in-Time Gradient Compression via Packet Trimming

X Chen, S Vargaftik, RB Basat - Proceedings of the 23rd ACM Workshop …, 2024 - dl.acm.org

Distributed training of ML models generates significant network traffic when exchanging
gradients and is sensitive to packet drops and retransmission caused by congestion when …

被引用次数：3 相关文章所有 3 个版本

[PDF] a2r-lab.org

A performance analysis of parallel differential dynamic programming on a gpu

B Plancher, S Kuindersma - … Foundations of Robotics XIII: Proceedings of …, 2020 - Springer

Parallelism can be used to significantly increase the throughput of computationally
expensive algorithms. With the widespread adoption of parallel computing platforms such as …

被引用次数：50 相关文章所有 10 个版本

[PDF] hal.science

GPU parallelization strategies for metaheuristics: a survey

M Essaid, L Idoumghar, J Lepagnot… - International Journal of …, 2019 - Taylor & Francis

Metaheuristics have been showing interesting results in solving hard optimization problems.
However, they become limited in terms of effectiveness and runtime for high dimensional …

被引用次数：53 相关文章所有 6 个版本

高级搜索

QQ 群