Fast multilevel Lagrangian carotid strain imaging with GPU computing

NH Meshram, T Varghese - 2017 IEEE International Ultrasonics …, 2017 - ieeexplore.ieee.org
2017 IEEE International Ultrasonics Symposium (IUS), 2017ieeexplore.ieee.org
Lagrangian carotid strain imaging (LCSI) involves estimation of deformation in the carotid
artery due to blood pressure variations under cardiac pulsation. Local strain over a cardiac
cycle is tracked, which is computationally intensive. We incur long offline processing times
for LCSI which becomes a limiting factor for clinical adoption. We report on the
computational speedup obtained for a parallelized implementation of LCSI using CUDA
programming for fast computation. LCSI is currently performed using a multi-level block …
Lagrangian carotid strain imaging (LCSI) involves estimation of deformation in the carotid artery due to blood pressure variations under cardiac pulsation. Local strain over a cardiac cycle is tracked, which is computationally intensive. We incur long offline processing times for LCSI which becomes a limiting factor for clinical adoption. We report on the computational speedup obtained for a parallelized implementation of LCSI using CUDA programming for fast computation. LCSI is currently performed using a multi-level block matching algorithm written in C++ using the Insight Toolkit (ITK) system. We have implemented this code on a NVIDIA k40 GPU for running CUDA kernels called from the ITK C++ code. The multi-level algorithm consists of three processing stages; stage 1 performs block matching at the coarsest level while level 3 performs block-matching at the finest scale on radiofrequency signals. The regularization step which incurred the largest computational time was implemented on the GPU. Cross-correlation was then implemented with the regularization step thereby avoiding a CPU to GPU data transfer. Shared memory was used in the regularization step to further reduce processing time. The computation time per frame pairs for LCSI with our initial implementation was about 316.41 secs for an in vivo human carotid data set, thereby taking 131 minutes for an entire loop over a cardiac cycle with 25 frames. GPU implementation of regularization provided per frame results in 99.92 secs, a speedup of 3.16X. Further optimization with implementation of cross correlation on the GPU and use of shared memory improved the computation time to 23 secs per frame, a speed up of 13.75X, reducing processing time to 9.5 minutes over a cardiac cycle.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果