Warped-preexecution: A GPU pre-execution approach for improving latency hiding

K Kim, S Lee, MK Yoon, G Koo, WW Ro… - … Symposium on High …, 2016 - ieeexplore.ieee.org
This paper presents a pre-execution approach for improving GPU performance, called P-
mode (pre-execution mode). GPUs utilize a number of concurrent threads for hiding …

APRES: Improving cache efficiency by exploiting load characteristics on GPUs

Y Oh, K Kim, MK Yoon, JH Park, Y Park… - ACM SIGARCH …, 2016 - dl.acm.org
Long memory latency and limited throughput become performance bottlenecks of GPGPU
applications. The latency takes hundreds of cycles which is difficult to be hidden by simply …

A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneity

M Khairy, AG Wassal, M Zahran - Journal of Parallel and Distributed …, 2019 - Elsevier
With the skyrocketing advances of process technology, the increased need to process huge
amount of data, and the pivotal need for power efficiency, the usage of Graphics Processing …

FineReg: Fine-grained register file management for augmenting GPU throughput

Y Oh, MK Yoon, WJ Song… - 2018 51st Annual IEEE …, 2018 - ieeexplore.ieee.org
Graphics processing units (GPUs) include a large amount of hardware resources for parallel
thread executions. However, the resources are not fully utilized during runtime, and …

Data optimization: Minimizing residual interprocessor data motion on simd machines

K Knobe, V Natarajan - Third Symposium on the Frontiers of …, 1990 - computer.org
This paper presents a pre-execution approach for improving GPU performance, called P-
mode (pre-execution mode). GPUs utilize a number of concurrent threads for hiding …

Dynamic resizing on active warps scheduler to hide operation stalls on GPUs

MK Yoon, Y Oh, SH Kim, S Lee, D Kim… - IEEE Transactions on …, 2017 - ieeexplore.ieee.org
This paper conducts a detailed study of the factors affecting the operation stalls in terms of
the fetch group size on the warp scheduler of GPUs. Throughout this paper, we reveal that …

Locality and scheduling in the massively multithreaded era

TG Rogers - 2015 - open.library.ubc.ca
Massively parallel processing devices, like Graphics Processing Units (GPUs), have the
ability to accelerate highly parallel workloads in an energy-efficient manner. However …

Latency hiding based warp scheduling policy for high performance GPUs

GB Kim, JM Kim, CH Kim - Journal of The Korea Society of …, 2019 - koreascience.kr
Abstract LRR (Loose Round Robin) warp scheduling policy for GPU architecture results in
high warp-level parallelism and balanced loads across multiple warps. However, traditional …

A distributed architecture and design challenges of an astray pilgrim tracking system

MAR Abdeen - 2018 IEEE 16th Intl Conf on Dependable …, 2018 - ieeexplore.ieee.org
In this paper we present a distributed architecture to address the problems of managing,
tracking, and predicting astray person in large crowds. Our case study considers the …

Methods and apparatus for intra-wave texture looping

AE Gruber - US Patent 11,640,647, 2023 - Google Patents
US11640647B2 - Methods and apparatus for intra-wave texture looping - Google Patents
US11640647B2 - Methods and apparatus for intra-wave texture looping - Google Patents …