Locality-aware CTA clustering for modern GPUs

A Li, SL Song, W Liu, X Liu, A Kumar… - ACM SIGARCH …, 2017 - dl.acm.org
Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern
GPUs is often awkward. The locality among global memory requests from different SMs …

Cudaadvisor: Llvm-based runtime profiling for modern gpus

D Shen, SL Song, A Li, X Liu - … of the 2018 International Symposium on …, 2018 - dl.acm.org
General-purpose GPUs have been widely utilized to accelerate parallel applications. Given
a relatively complex programming model and fast architecture evolution, producing efficient …

Cooperative caching for GPUs

S Dublish, V Nagarajan, N Topham - ACM Transactions on Architecture …, 2016 - dl.acm.org
The rise of general-purpose computing on GPUs has influenced architectural innovation on
them. The introduction of an on-chip cache hierarchy is one such innovation. High L1 miss …

The implications of page size management on graph analytics

A Manocha, Z Yan, E Tureci, JL Aragón… - 2022 IEEE …, 2022 - ieeexplore.ieee.org
Graph representations of data are ubiquitous in analytic applications. However, graph
workloads are notorious for having irregular memory access patterns with variable access …

FineReg: Fine-grained register file management for augmenting GPU throughput

Y Oh, MK Yoon, WJ Song… - 2018 51st Annual IEEE …, 2018 - ieeexplore.ieee.org
Graphics processing units (GPUs) include a large amount of hardware resources for parallel
thread executions. However, the resources are not fully utilized during runtime, and …

Graphfire: Synergizing fetch, insertion, and replacement policies for graph analytics

A Manocha, JL Aragón… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Despite their ubiquity in many important big-data applications, graph analytic kernels
continue to challenge modern memory hierarchies due to their frequent, long-latency …

Linebacker: Preserving victim cache lines in idle register files of GPUs

Y Oh, G Koo, M Annavaram, WW Ro - Proceedings of the 46th …, 2019 - dl.acm.org
Modern GPUs suffer from cache contention due to the limited cache size that is shared
across tens of concurrently running warps. To increase the per-warp cache size prior …

Scrabble: A fine-grained cache with adaptive merged block

C Zhang, Y Zeng, X Guo - IEEE Transactions on Computers, 2019 - ieeexplore.ieee.org
A large fraction of the microprocessor energy is consumed by the data movement in the
system. One of the reasons is the inefficiency in the conventional cache design. Cache …

Improving Data Movement Efficiency in the Memory Systems for Irregular Applications

C Zhang - 2021 - search.proquest.com
Modern processors have a large processor-memory frequency gap, which urges the
computer designer to address the issue of the inefficiency of the memory system …

Efficient gpu-based query processing with pruned list caching in search engines

D Wang, W Yu, RJ Stones, J Ren… - 2017 IEEE 23rd …, 2017 - ieeexplore.ieee.org
There are two inherent obstacles to effectively using Graphics Processing Units (GPUs) for
query processing in search engines:(a) the highly restricted GPU memory space, and (b) the …