Mukea: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering

Y Ding, J Yu, B Liu, Y Hu, M Cui… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Abstract Knowledge-based visual question answering requires the ability of associating
external knowledge for open-ended cross-modal scene understanding. One limitation of …

A unified end-to-end retriever-reader framework for knowledge-based vqa

Y Guo, L Nie, Y Wong, Y Liu, Z Cheng… - Proceedings of the 30th …, 2022 - dl.acm.org
Knowledge-based Visual Question Answering (VQA) expects models to rely on external
knowledge for robust answer prediction. Though significant it is, this paper discovers several …

Resolving zero-shot and fact-based visual question answering via enhanced fact retrieval

S Wu, G Zhao, X Qian - IEEE Transactions on Multimedia, 2023 - ieeexplore.ieee.org
Practical applications with visual question answering (VQA) systems are challenging, and
recent research has aimed at investigating this important field. Many issues related to real …

Semantic collaborative learning for cross-modal moment localization

Y Hu, K Wang, M Liu, H Tang, L Nie - ACM Transactions on Information …, 2023 - dl.acm.org
Localizing a desired moment within an untrimmed video via a given natural language query,
ie, cross-modal moment localization, has attracted widespread research attention recently …

HybridPrompt: bridging language models and human priors in prompt tuning for visual question answering

Z Ma, Z Yu, J Li, G Li - Proceedings of the AAAI Conference on Artificial …, 2023 - ojs.aaai.org
Abstract Visual Question Answering (VQA) aims to answer the natural language question
about a given image by understanding multimodal content. However, the answer quality of …

Exploiting the Social-Like Prior in Transformer for Visual Reasoning

Y Han, Y Hu, X Song, H Tang, M Xu… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Benefiting from instrumental global dependency modeling of self-attention (SA), transformer-
based approaches have become the pivotal choices for numerous downstream visual …

Scenegate: Scene-graph based co-attention networks for text visual question answering

F Cao, S Luo, F Nunez, Z Wen, J Poon, SC Han - Robotics, 2023 - mdpi.com
Visual Question Answering (VQA) models fail catastrophically on questions related to the
reading of text-carrying images. However, TextVQA aims to answer questions by …

Multimodal Bi-direction guided attention networks for visual question answering

L Cai, N Xu, H Tian, K Chen, H Fan - Neural Processing Letters, 2023 - Springer
Current visual question answering (VQA) has become a research hotspot in the computer
vision and natural language processing field. A core solution of VQA is how to fuse multi …

Bridging the Cross-Modality Semantic Gap in Visual Question Answering

B Wang, Y Ma, X Li, J Gao, Y Hu… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
The objective of visual question answering (VQA) is to adequately comprehend a question
and identify relevant contents in an image that can provide an answer. Existing approaches …

Boosting Visual Question Answering Through Geometric Perception and Region Features

H Yu, Z Wang, Y Liu, H Liu - ECAI 2023, 2023 - ebooks.iospress.nl
Visual question answering (VQA) is a crucial yet challenging task in multimodal
understanding. To correctly answer questions about an image, VQA models are required to …