Revealing critical loads and hidden data locality in GPGPU applications

N Vijaykumar, E Ebrahimi, K Hsieh… - 2018 ACM/IEEE 45th …, 2018 - ieeexplore.ieee.org

Exploiting data locality in GPUs is critical to making more efficient use of the existing caches
and the NUMA-based memory hierarchy expected in future GPUs. While modern GPU …

被引用次数：79 相关文章所有 8 个版本

[PDF] acm.org

Access pattern-aware cache management for improving data utilization in GPU

G Koo, Y Oh, WW Ro, M Annavaram - Proceedings of the 44th annual …, 2017 - dl.acm.org

Long latency of memory operation is a prominent performance bottleneck in graphics
processing units (GPUs). The small data cache that must be shared across dozens of warps …

被引用次数：86 相关文章所有 8 个版本

[PDF] acm.org Full View

Paver: Locality graph-based thread block scheduling for gpus

D Tripathy, A Abdolrashidi, LN Bhuyan, L Zhou… - ACM Transactions on …, 2021 - dl.acm.org

The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache
sizes per thread, leading to serious cache contention problems such as thrashing. Hence …

被引用次数：34 相关文章所有 6 个版本

[PDF] acm.org Full View

Cross-core Data Sharing for Energy-efficient GPUs

H Falahati, M Sadrosadati, Q Xu… - ACM Transactions on …, 2024 - dl.acm.org

Graphics Processing Units (GPUs) are the accelerator of choice in a variety of application
domains, because they can accelerate massively parallel workloads and can be easily …

被引用次数：1 相关文章

[PDF] nsf.gov

Analyzing and leveraging remote-core bandwidth for enhanced performance in GPUs

MA Ibrahim, H Liu, O Kayiran… - 2019 28th International …, 2019 - ieeexplore.ieee.org

Bandwidth achieved from local/shared caches and memory is a major performance
determinant in Graphics Processing Units (GPUs). These existing sources of bandwidth are …

被引用次数：23 相关文章所有 10 个版本

[PDF] acm.org

Analyzing and leveraging shared L1 caches in GPUs

MA Ibrahim, O Kayiran, Y Eckert, GH Loh… - Proceedings of the ACM …, 2020 - dl.acm.org

Graphics Processing Units (GPUs) concurrently execute thousands of threads, which makes
them effective for achieving high throughput for a wide range of applications. However, the …

被引用次数：16 相关文章所有 9 个版本

[PDF] usc.edu

G-TSC: Timestamp based coherence for GPUs

A Tabbakh, X Qian… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org

Cache coherence has been studied extensively in the context of chip multiprocessors
(CMP). It is well known that conventional directory-based and snooping coherence protocols …

被引用次数：22 相关文章所有 3 个版本

Cta-aware prefetching and scheduling for gpu

G Koo, H Jeon, Z Liu, NS Kim… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org

Albeit GPUs are supposed to be tolerant to long latency of data fetch operation, we observe
that L1 cache misses occur in a bursty manner for many memory-intensive applications. This …

被引用次数：27 相关文章所有 4 个版本

Linebacker: Preserving victim cache lines in idle register files of GPUs

Y Oh, G Koo, M Annavaram, WW Ro - Proceedings of the 46th …, 2019 - dl.acm.org

Modern GPUs suffer from cache contention due to the limited cache size that is shared
across tens of concurrently running warps. To increase the per-warp cache size prior …

被引用次数：13 相关文章所有 7 个版本

[PDF] ieee.org

Analyzing gcn aggregation on gpu

I Kim, J Jeong, Y Oh, MK Yoon, G Koo - IEEE Access, 2022 - ieeexplore.ieee.org

Graph convolutional neural networks (GCNs) are emerging neural networks for graph
structures that include large features associated with each vertex. The operations of GCN …

被引用次数：2 相关文章所有 4 个版本

高级搜索

QQ 群