The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs

N Vijaykumar, E Ebrahimi, K Hsieh… - 2018 ACM/IEEE 45th …, 2018 - ieeexplore.ieee.org
Exploiting data locality in GPUs is critical to making more efficient use of the existing caches
and the NUMA-based memory hierarchy expected in future GPUs. While modern GPU …

Access pattern-aware cache management for improving data utilization in GPU

G Koo, Y Oh, WW Ro, M Annavaram - Proceedings of the 44th annual …, 2017 - dl.acm.org
Long latency of memory operation is a prominent performance bottleneck in graphics
processing units (GPUs). The small data cache that must be shared across dozens of warps …

Paver: Locality graph-based thread block scheduling for gpus

D Tripathy, A Abdolrashidi, LN Bhuyan, L Zhou… - ACM Transactions on …, 2021 - dl.acm.org
The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache
sizes per thread, leading to serious cache contention problems such as thrashing. Hence …

Cross-core Data Sharing for Energy-efficient GPUs

H Falahati, M Sadrosadati, Q Xu… - ACM Transactions on …, 2024 - dl.acm.org
Graphics Processing Units (GPUs) are the accelerator of choice in a variety of application
domains, because they can accelerate massively parallel workloads and can be easily …

Analyzing and leveraging remote-core bandwidth for enhanced performance in GPUs

MA Ibrahim, H Liu, O Kayiran… - 2019 28th International …, 2019 - ieeexplore.ieee.org
Bandwidth achieved from local/shared caches and memory is a major performance
determinant in Graphics Processing Units (GPUs). These existing sources of bandwidth are …

Analyzing and leveraging shared L1 caches in GPUs

MA Ibrahim, O Kayiran, Y Eckert, GH Loh… - Proceedings of the ACM …, 2020 - dl.acm.org
Graphics Processing Units (GPUs) concurrently execute thousands of threads, which makes
them effective for achieving high throughput for a wide range of applications. However, the …

G-TSC: Timestamp based coherence for GPUs

A Tabbakh, X Qian… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org
Cache coherence has been studied extensively in the context of chip multiprocessors
(CMP). It is well known that conventional directory-based and snooping coherence protocols …

Cta-aware prefetching and scheduling for gpu

G Koo, H Jeon, Z Liu, NS Kim… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org
Albeit GPUs are supposed to be tolerant to long latency of data fetch operation, we observe
that L1 cache misses occur in a bursty manner for many memory-intensive applications. This …

Linebacker: Preserving victim cache lines in idle register files of GPUs

Y Oh, G Koo, M Annavaram, WW Ro - Proceedings of the 46th …, 2019 - dl.acm.org
Modern GPUs suffer from cache contention due to the limited cache size that is shared
across tens of concurrently running warps. To increase the per-warp cache size prior …

Analyzing gcn aggregation on gpu

I Kim, J Jeong, Y Oh, MK Yoon, G Koo - IEEE Access, 2022 - ieeexplore.ieee.org
Graph convolutional neural networks (GCNs) are emerging neural networks for graph
structures that include large features associated with each vertex. The operations of GCN …