Efficient non-fused winograd on gpus

H Wei, E Liu, Y Zhao, H Yu - … , CGI 2020, Geneva, Switzerland, October 20 …, 2020 - Springer
H Wei, E Liu, Y Zhao, H Yu
Advances in Computer Graphics: 37th Computer Graphics International Conference …, 2020Springer
This paper presents an optimized implementation for Winograd non-fused convolution. Our
optimizations comprise application-independent grouped producer-consumer chains and a
set of Winograd-specific software techniques, including specialized interface-kernels data
format which enhances memory access efficiency; warp specialization and double buffer
prefetching which effectively exploit computational resources and memory bandwidth;
utilizing “shuffle” instruction which conserves hardware resources. The paper also provides …
Abstract
This paper presents an optimized implementation for Winograd non-fused convolution. Our optimizations comprise application-independent grouped producer-consumer chains and a set of Winograd-specific software techniques, including specialized interface-kernels data format which enhances memory access efficiency; warp specialization and double buffer prefetching which effectively exploit computational resources and memory bandwidth; utilizing “shuffle” instruction which conserves hardware resources. The paper also provides supplementary explanation of Winograds’ tile extraction, which saves memory and computing resources.
The proposed techniques has been evaluated head to head by kernel level in GTX 980 GPU, CUDA 9.2 with a wide range of parameters which meet CNN layers benchmark. Compared with the state-of-the-art Winograd Non-fused convolution in CuDnn 7.6.4 (released in Sept, 2019), our implementation achieves a total speedup of 1.64x.
Springer
以上显示的是最相近的搜索结果。 查看全部搜索结果