Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR

Z Li, Y Guo, K Wang, X Chen, L Nie… - Proceedings of the 31st …, 2023 - dl.acm.org
Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question
answering over visual scenes. To achieve this goal, a model is required to provide an …

Learning Feature Semantic Matching for Spatio-Temporal Video Grounding

T Zhang, H Fang, H Zhang, J Gao, X Lu… - IEEE Transactions …, 2024 - ieeexplore.ieee.org
Spatio-temporal video grounding (STVG) aims to localize a spatio-temporal tube, including
temporal boundaries and object bounding boxes, that semantically corresponds to a given …

Two-Step Discrete Hashing for Cross-Modal Retrieval

J Tu, X Liu, Y Hao, R Hong… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Cross-modal hashing is an effective approach for information retrieval from large and
heterogeneous cross-modal datasets, owing to its low storage cost and high computational …

Unbiased Visual Question Answering by Leveraging Instrumental Variable

Y Pan, J Liu, L Jin, Z Li - IEEE Transactions on Multimedia, 2024 - ieeexplore.ieee.org
Existing unbiased visual question answering (VQA) models reduce the spurious correlation
between questions and answers to force the models to focus on visual information …

How to Use Language Expert to Assist Inference for Visual Commonsense Reasoning

Z Song, W Hu, H Ye, R Hong - 2023 IEEE International …, 2023 - ieeexplore.ieee.org
Visual Commonsense Reasoning (VCR) task requires Vision and Language Model (VLM) to
capture cognitive level clues from the visual-language input and give the right answers to …

Multi-modal Large Language Model Enhanced Pseudo 3D Perception Framework for Visual Commonsense Reasoning

J Zhu, H Wang, M Shi - … on Circuits and Systems for Video …, 2024 - ieeexplore.ieee.org
The visual commonsense reasoning (VCR) task is to choose an answer and provide a
justifying rationale based on the given image and textural question. Representative works …

[PDF][PDF] Multi-task Visual Semantic Embedding Network for image-text retrieval

XY Qin, LS Li, JY Tang, F Hao, ML Ge, GY Pang - Journal of Computer Science … - iccvm.org
Image-text retrieval aims to capture the semantic correspondence between images and
texts, which serves as a foundation and crucial component in multi-modal recommendations …