TPU-KNN: K nearest neighbor search at peak flop/s

F Chern, B Hechtman, A Davis, R Guo… - Advances in …, 2022 - proceedings.neurips.cc
This paper presents a novel nearest neighbor search algorithm achieving TPU (Google
Tensor Processing Unit) peak performance, outperforming state-of-the-art GPU algorithms …

GPU Database Systems Characterization and Optimization

J Cao, R Sen, M Interlandi, J Arulraj, H Kim - Proceedings of the VLDB …, 2023 - dl.acm.org
GPUs offer massive parallelism and high-bandwidth memory access, making them an
attractive option for accelerating data analytics in database systems. However, while modern …

Logan: High-performance gpu-based x-drop long-read alignment

A Zeni, G Guidi, M Ellis, N Ding… - 2020 IEEE …, 2020 - ieeexplore.ieee.org
Pairwise sequence alignment is one of the most computationally intensive kernels in
genomic data analysis, accounting for more than 90% of the runtime for key bioinformatics …

A comprehensive methodology to optimize FPGA designs via the roofline model

M Siracusa, E Del Sozzo, M Rabozzi… - IEEE Transactions …, 2021 - ieeexplore.ieee.org
With reconfigurable fabrics delivering increasing performance over the years, Field-
Programmable Gate Arrays (FPGAs) are becoming an appealing solution for next …

Hybrid, scalable, trace-driven performance modeling of GPGPUs

Y Arafa, AH Badawy, A ElWazir, A Barai… - Proceedings of the …, 2021 - dl.acm.org
In this paper, we present PPT-GPU, a scalable performance prediction toolkit for GPUs. PPT-
GPU achieves scalability through a hybrid high-level modeling approach where some …

Timemory: modular performance analysis for HPC

JR Madsen, MG Awan, H Brunie, J Deslippe… - … Conference, ISC High …, 2020 - Springer
HPC has undergone a significant transition toward heterogeneous architectures. This
transition has introduced several issues in code migration to support multiple frameworks for …

Fast HBM access with FPGAs: Analysis, architectures, and applications

P Holzinger, D Reiser, T Hahn… - 2021 IEEE …, 2021 - ieeexplore.ieee.org
Over the past few decades, the gap between rapidly increasing computational power and
almost stagnating memory bandwidth has steadily worsened. Recently, 3D die-stacking in …

[HTML][HTML] GPU-optimized approaches to molecular docking-based virtual screening in drug discovery: A comparative analysis

E Vitali, F Ficarelli, M Bisson, D Gadioli… - Journal of Parallel and …, 2024 - Elsevier
Finding a novel drug is a very long and complex procedure. Using computer simulations, it is
possible to accelerate the preliminary phases by performing a virtual screening that filters a …

A cad-based methodology to optimize hls code via the roofline model

M Siracusa, L Di Tucci, M Rabozzi, S Williams… - Proceedings of the 39th …, 2020 - dl.acm.org
The intrinsic complexity of modern computing systems requires structured methods for
analyzing and optimizing application performance. In this context, the Roofline model …

Hierarchical roofline performance analysis for deep learning applications

C Yang, Y Wang, T Kurth, S Farrell… - … Computing: Proceedings of …, 2021 - Springer
This paper presents a practical methodology for collecting performance data necessary to
conduct hierarchical Roofline analysis on NVIDIA GPUs. It discusses the extension of the …