Paver: Locality graph-based thread block scheduling for gpus

D Tripathy, A Abdolrashidi, LN Bhuyan, L Zhou… - ACM Transactions on …, 2021 - dl.acm.org
The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache
sizes per thread, leading to serious cache contention problems such as thrashing. Hence …

Blockmaestro: Enabling programmer-transparent task-based execution in gpu systems

AA Abdolrashidi, HA Esfeden… - 2021 ACM/IEEE 48th …, 2021 - ieeexplore.ieee.org
As modern GPU workloads grow in size and complexity, there is an ever-increasing demand
for GPU computational power. Emerging workloads contain hundreds or thousands of GPU …

Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs

MA Shoushtary, JM Arnau, JT Murgadas… - 2024 ACM/IEEE 51st …, 2024 - ieeexplore.ieee.org
Modern GPUs require an enormous register file (RF) to store the context of thousands of
active threads. It consumes considerable energy and contains multiple large banks to …

GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction

I Chaturvedi, BR Godala, Y Wu, Z Xu… - 2024 ACM/IEEE 51st …, 2024 - ieeexplore.ieee.org
Graphics Processing Units (GPUs) use massive multi-threading coupled with static
scheduling to hide instruction latencies. Despite this, memory instructions pose a challenge …

CASH-RF: A compiler-assisted hierarchical register file in GPUs

Y Oh, I Jeong, WW Ro, MK Yoon - IEEE Embedded Systems …, 2022 - ieeexplore.ieee.org
Spin-transfer torque magnetic random-access memory (STT-MRAM) is an emerging
nonvolatile memory technology that has been received significant attention due to its higher …

Conflict-aware compiler for hierarchical register file on GPUs

E Jeong, ES Park, G Koo, Y Oh, MK Yoon - Journal of Systems Architecture, 2024 - Elsevier
Modern graphics processing units (GPUs) leverage a high degree of thread-level
parallelism, necessitating large-sized register files for storing numerous thread contexts. To …

TEA-RC: Thread Context-Aware Register Cache for GPUs

I Jeong, Y Oh, WW Ro, MK Yoon - IEEE Access, 2022 - ieeexplore.ieee.org
Graphics processing units (GPUs) achieve high throughput by exploiting a high degree of
thread-level parallelism (TLP). To support such high TLP, GPUs have a large-sized register …

Highly concurrent latency-tolerant register files for GPUs

M Sadrosadati, A Mirhosseini, A Hajiabadi… - ACM Transactions on …, 2021 - dl.acm.org
Graphics Processing Units (GPUs) employ large register files to accommodate all active
threads and accelerate context switching. Unfortunately, register files are a scalability …

Lightweight Register File Caching in Collector Units for GPUs

M Abaie Shoushtary, JM Arnau… - Proceedings of the 15th …, 2023 - dl.acm.org
Modern GPUs benefit from a sizable Register File (RF) to provide fine-grained thread
switching. As the RF is huge and accessed frequently, it consumes a considerable share of …

A Lightweight, Compiler-Assisted Register File Cache for GPGPU

MA Shoushtary, JM Arnau, JT Murgadas… - arXiv preprint arXiv …, 2023 - arxiv.org
Modern GPUs require an enormous register file (RF) to store the context of thousands of
active threads. It consumes considerable energy and contains multiple large banks to …