Performance-centric register file design for GPUs using racetrack memory

S Wang, Y Liang, C Zhang, X Xie, G Sun… - 2016 21st Asia and …, 2016 - ieeexplore.ieee.org
S Wang, Y Liang, C Zhang, X Xie, G Sun, Y Liu, Y Wang, X Li
2016 21st Asia and South Pacific Design Automation Conference (ASP …, 2016ieeexplore.ieee.org
The key to high performance for GPU architecture lies in massive threading to drive the large
number of cores and enable overlapping of threading execution. However, in reality, the
number of threads that can simultaneously execute is often limited by the size of the register
file on GPUs. The traditional SRAM-based register file costs so large amount of chip area
that it cannot scale to meet the increasing demand of massive threading for GPU
applications. Racetrack memory is a promising technology for designing large capacity …
The key to high performance for GPU architecture lies in massive threading to drive the large number of cores and enable overlapping of threading execution. However, in reality, the number of threads that can simultaneously execute is often limited by the size of the register file on GPUs. The traditional SRAM-based register file costs so large amount of chip area that it cannot scale to meet the increasing demand of massive threading for GPU applications. Racetrack memory is a promising technology for designing large capacity register file on GPUs due to its high data storage density. However, without careful deployment of registers, the lengthy shift operation of racetrack memory may hurt the performance. In this paper, we explore racetrack memory for designing high performance register file for GPU architecture. High storage density racetrack memory helps to improve the thread level parallelism, i.e., the number of threads that simultaneously execute. However, if the bits of the registers are not aligned to the ports, shift operations are required to move the bits to the ports. To mitigate the shift operation overhead problem, we develop a register file preshifting strategy and a compile-time managed register mapping algorithm. Experimental results demonstrate that our technique achieves up to 24% (19% on average) improvement in performance for a variety of GPU applications.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果