H Yao, L Wang, C Cai, W Wang, Z Zhang… - Image and Vision …, 2024 - Elsevier
Visual grounding (VG) is a task that requires to locate a specific region in an image
according to a natural language expression. Existing efforts on the VG task are divided into …