{Locality-Aware} Software Throttling for Sparse Matrix Operation on {GPUs}

Y Niu, Z Lu, H Ji, S Song, Z Jin, W Liu - Proceedings of the 27th ACM …, 2022 - dl.acm.org

Sparse general matrix-matrix multiplication (SpGEMM) is one of the most fundamental
building blocks in sparse linear solvers, graph processing frameworks and machine learning …

被引用次数：50 相关文章所有 4 个版本

[PDF] acm.org Full View

Paver: Locality graph-based thread block scheduling for gpus

D Tripathy, A Abdolrashidi, LN Bhuyan, L Zhou… - ACM Transactions on …, 2021 - dl.acm.org

The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache
sizes per thread, leading to serious cache contention problems such as thrashing. Hence …

被引用次数：34 相关文章所有 6 个版本

[PDF] tsinghua.edu.cn

Random walks on huge graphs at cache efficiency

K Yang, X Ma, S Thirumuruganathan, K Chen… - Proceedings of the ACM …, 2021 - dl.acm.org

Data-intensive applications dominated by random accesses to large working sets fail to
utilize the computing power of modern processors. Graph random walk, an indispensable …

被引用次数：21 相关文章所有 3 个版本

[PDF] ieee.org

Enabling efficient fast convolution algorithms on GPUs via MegaKernels

L Jia, Y Liang, X Li, L Lu, S Yan - IEEE Transactions on …, 2020 - ieeexplore.ieee.org

Modern Convolutional Neural Networks (CNNs) require a massive amount of convolution
operations. To address the overwhelming computation problem, Winograd and FFT fast …

被引用次数：25 相关文章所有 4 个版本

Compiler-assisted GPU thread throttling for reduced cache contention

H Kim, S Hong, H Lee, E Seo, H Han - Proceedings of the 48th …, 2019 - dl.acm.org

Modern GPUs concurrently deploy thousands of threads to maximize thread level
parallelism (TLP) for performance. For some applications, however, maximized TLP leads to …

被引用次数：12 相关文章所有 2 个版本

GPU thread throttling for page-level thrashing reduction via static analysis

H Kim, H Han - The Journal of Supercomputing, 2024 - Springer

Unified virtual memory was introduced in modern GPUs to enable a new programming
model for programmers. This method manages memory pages between the GPU and CPU …

Device Hopping: Transparent Mid-Kernel Runtime Switching for Heterogeneous Systems

P Metzger, V Seeker, C Fensch, M Cole - ACM Transactions on …, 2021 - dl.acm.org

Existing OS techniques for homogeneous many-core systems make it simple for single and
multithreaded applications to migrate between cores. Heterogeneous systems do not benefit …

被引用次数：1 相关文章所有 2 个版本

[PDF] ieee.org

EZLDA: Efficient and Scalable LDA on GPUs

S Wang, H Liu, A Gaihre, H Yu - IEEE Access, 2023 - ieeexplore.ieee.org

Latent Dirichlet Allocation (LDA) is a statistical approach for topic modeling with a wide
range of applications. Attracted by the exceptional computing and memory throughput …

被引用次数：2 相关文章所有 5 个版本

GPU Accelerated Latent Dirichlet Allocation

S Wang - 2024 - search.proquest.com

Abstract Latent Dirichlet Allocation (LDA) is a statistical approach for topic modeling with a
wide range of applications. LDA can be subdivided into flatten model and hierarchical …

[PDF] ed.ac.uk

Programmer-transparent efficient parallelism with skeletons

P Metzger - 2021 - era.ed.ac.uk

Parallel and heterogeneous systems are ubiquitous. Unfortunately, both require significant
complexity at the software level to the detriment of programmer productivity. To produce …

高级搜索

QQ 群