amenable to massively parallel computing accelerated with general purpose graphics
processing units (GPUs). However, the computational performance of such schemes
strongly depends on their implementation. In the past, several implementation strategies
have been proposed. They are based exclusively on specialized compute kernels tuned for
each operation, or they can leverage BLAS libraries that provide optimized routines for basic …