TileSpGEMM: A tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs

Y Niu, Z Lu, H Ji, S Song, Z Jin, W Liu - Proceedings of the 27th ACM …, 2022 - dl.acm.org
Sparse general matrix-matrix multiplication (SpGEMM) is one of the most fundamental
building blocks in sparse linear solvers, graph processing frameworks and machine learning …

Paver: Locality graph-based thread block scheduling for gpus

D Tripathy, A Abdolrashidi, LN Bhuyan, L Zhou… - ACM Transactions on …, 2021 - dl.acm.org
The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache
sizes per thread, leading to serious cache contention problems such as thrashing. Hence …

Random walks on huge graphs at cache efficiency

K Yang, X Ma, S Thirumuruganathan, K Chen… - Proceedings of the ACM …, 2021 - dl.acm.org
Data-intensive applications dominated by random accesses to large working sets fail to
utilize the computing power of modern processors. Graph random walk, an indispensable …

Enabling efficient fast convolution algorithms on GPUs via MegaKernels

L Jia, Y Liang, X Li, L Lu, S Yan - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Modern Convolutional Neural Networks (CNNs) require a massive amount of convolution
operations. To address the overwhelming computation problem, Winograd and FFT fast …

Compiler-assisted GPU thread throttling for reduced cache contention

H Kim, S Hong, H Lee, E Seo, H Han - Proceedings of the 48th …, 2019 - dl.acm.org
Modern GPUs concurrently deploy thousands of threads to maximize thread level
parallelism (TLP) for performance. For some applications, however, maximized TLP leads to …

GPU thread throttling for page-level thrashing reduction via static analysis

H Kim, H Han - The Journal of Supercomputing, 2024 - Springer
Unified virtual memory was introduced in modern GPUs to enable a new programming
model for programmers. This method manages memory pages between the GPU and CPU …

Device Hopping: Transparent Mid-Kernel Runtime Switching for Heterogeneous Systems

P Metzger, V Seeker, C Fensch, M Cole - ACM Transactions on …, 2021 - dl.acm.org
Existing OS techniques for homogeneous many-core systems make it simple for single and
multithreaded applications to migrate between cores. Heterogeneous systems do not benefit …

EZLDA: Efficient and Scalable LDA on GPUs

S Wang, H Liu, A Gaihre, H Yu - IEEE Access, 2023 - ieeexplore.ieee.org
Latent Dirichlet Allocation (LDA) is a statistical approach for topic modeling with a wide
range of applications. Attracted by the exceptional computing and memory throughput …

GPU Accelerated Latent Dirichlet Allocation

S Wang - 2024 - search.proquest.com
Abstract Latent Dirichlet Allocation (LDA) is a statistical approach for topic modeling with a
wide range of applications. LDA can be subdivided into flatten model and hierarchical …

Programmer-transparent efficient parallelism with skeletons

P Metzger - 2021 - era.ed.ac.uk
Parallel and heterogeneous systems are ubiquitous. Unfortunately, both require significant
complexity at the software level to the detriment of programmer productivity. To produce …