Evaluating Performance and Portability of a core bioinformatics kernel on multiple vendor GPUs

M Haseeb, N Ding, J Deslippe… - … and productivity in HPC …, 2021 - ieeexplore.ieee.org
Traditional scientific simulations have for quite some time, dominated the workloads of high-
performance computing infrastructures across the world. With recent advancement in data …

Dynamic stashing quantization for efficient transformer training

G Yang, D Lo, R Mullins, Y Zhao - arXiv preprint arXiv:2303.05295, 2023 - arxiv.org
Large Language Models (LLMs) have demonstrated impressive performance on a range of
Natural Language Processing (NLP) tasks. Unfortunately, the immense amount of …

A comparison of two performance portability metrics

A Marowka - Concurrency and Computation: Practice and …, 2023 - Wiley Online Library
The rise in the demand for new performance portability frameworks for heterogeneous
computing systems has brought with it a number of proposals of workable metrics for …

GPU-acceleration of the distributed-memory database peptide search of mass spectrometry data

M Haseeb, F Saeed - Scientific Reports, 2023 - nature.com
Database peptide search is the primary computational technique for identifying peptides
from the mass spectrometry (MS) data. Graphical Processing Units (GPU) computing is now …

Shifting Between Compute and Memory Bounds: A Compression-Enabled Roofline Model

R Naraparaju, T Zhao, Y Hu, D Zhao… - SC24-W: Workshops …, 2024 - ieeexplore.ieee.org
In the evolving landscape of high-performance computing, especially to fight the end of
Moore's Law and Dennard's Scaling, the ability to shift between compute-bound and …

[HTML][HTML] Starlight: A kernel optimizer for GPU processing

A Zeni, E Del Sozzo, E D'Arnese, D Conficconi… - Journal of Parallel and …, 2024 - Elsevier
Over the past few years, GPUs have found widespread adoption in many scientific domains,
offering notable performance and energy efficiency advantages compared to CPUs …

Arch2End: Two-Stage Unified System-Level Modeling for Heterogeneous Intelligent Devices

W Liu, Z Zhu, B Li, Y Xiong, Z Lian… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
The surge in intelligent edge computing has propelled the adoption and expansion of the
distributed embedded systems (DESs). Numerous scheduling strategies are introduced to …

Predicting GPU kernel's performance on upcoming architectures

L Van Lanker, H Taboada, E Brunet… - European Conference on …, 2024 - Springer
With the advent of heterogeneous systems that combine CPUs and GPUs, designing a
supercomputer becomes more and more complex. The hardware characteristics of GPUs …

RAJA Performance Suite: Performance Portability Analysis with Caliper and Thicket

O Pearce, J Burmark, R Hornung… - SC24-W: Workshops …, 2024 - ieeexplore.ieee.org
Maintaining performant code in a world of fast-evolving computer architectures and
programming models poses a significant challenge to scientists. Typically, benchmark codes …

Performance Optimization in Deep Learning Systems

A Erben - 2024 - mediatum.ub.tum.de
This publication-based dissertation comprises two papers that argue for more efficient
resource utilization by challenging the status quo on common DL practices and providing …