efficient utilization of available computing resources of host CPU cores for CUDA kernels,
which are designed to run only on GPU. The proposed system exploits at runtime the coarse-
grain thread-level parallelism across CPU and GPU, without any source recompilation. To
this end, three features including a work distribution module, a transparent memory space,
and a global scheduling queue are described in this paper. With a completely automatic …