Lagrangian carotid strain imaging (LCSI) involves estimation of deformation in the carotid artery due to blood pressure variations under cardiac pulsation. Local strain over a cardiac cycle is tracked, which is computationally intensive. We incur long offline processing times for LCSI which becomes a limiting factor for clinical adoption. We report on the computational speedup obtained for a parallelized implementation of LCSI using CUDA programming for fast computation. LCSI is currently performed using a multi-level block matching algorithm written in C++ using the Insight Toolkit (ITK) system. We have implemented this code on a NVIDIA k40 GPU for running CUDA kernels called from the ITK C++ code. The multi-level algorithm consists of three processing stages; stage 1 performs block matching at the coarsest level while level 3 performs block-matching at the finest scale on radiofrequency signals. The regularization step which incurred the largest computational time was implemented on the GPU. Cross-correlation was then implemented with the regularization step thereby avoiding a CPU to GPU data transfer. Shared memory was used in the regularization step to further reduce processing time. The computation time per frame pairs for LCSI with our initial implementation was about 316.41 secs for an in vivo human carotid data set, thereby taking 131 minutes for an entire loop over a cardiac cycle with 25 frames. GPU implementation of regularization provided per frame results in 99.92 secs, a speedup of 3.16X. Further optimization with implementation of cross correlation on the GPU and use of shared memory improved the computation time to 23 secs per frame, a speed up of 13.75X, reducing processing time to 9.5 minutes over a cardiac cycle.