The multi-modal fusion in visual question answering: a review of attention mechanisms

S Lu, M Liu, L Yin, Z Yin, X Liu, W Zheng - PeerJ Computer Science, 2023 - peerj.com
Abstract Visual Question Answering (VQA) is a significant cross-disciplinary issue in the
fields of computer vision and natural language processing that requires a computer to output …

Positional attention guided transformer-like architecture for visual question answering

A Mao, Z Yang, K Lin, J Xuan… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Transformer architectures have recently been introduced into the field of visual question
answering (VQA), due to their powerful capabilities of information extraction and fusion …

Contrastive region guidance: Improving grounding in vision-language models without training

D Wan, J Cho, E Stengel-Eskin, M Bansal - arXiv preprint arXiv …, 2024 - arxiv.org
Highlighting particularly relevant regions of an image can improve the performance of vision-
language models (VLMs) on various vision-language (VL) tasks by guiding the model to …

Data efficient masked language modeling for vision and language

Y Bitton, G Stanovsky, M Elhadad… - arXiv preprint arXiv …, 2021 - arxiv.org
Masked language modeling (MLM) is one of the key sub-tasks in vision-language
pretraining. In the cross-modal setting, tokens in the sentence are masked at random, and …

Visfis: Visual feature importance supervision with right-for-the-right-reason objectives

Z Ying, P Hase, M Bansal - Advances in Neural Information …, 2022 - proceedings.neurips.cc
Many past works aim to improve visual reasoning in models by supervising feature
importance (estimated by model explanation techniques) with human annotations such as …

Robust visual question answering: Datasets, methods, and future challenges

J Ma, P Wang, D Kong, Z Wang, J Liu… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Visual question answering requires a system to provide an accurate natural language
answer given an image and a natural language question. However, it is widely recognized …

Guiding visual question answering with attention priors

TM Le, V Le, S Gupta… - Proceedings of the …, 2023 - openaccess.thecvf.com
The current success of modern visual reasoning systems is arguably attributed to cross-
modality attention mechanisms. However, in deliberative reasoning such as in VQA …

Co-attention graph convolutional network for visual question answering

C Liu, YY Tan, TT Xia, J Zhang, M Zhu - Multimedia Systems, 2023 - Springer
Abstract Visual Question Answering (VQA) is a challenging task that requires a fine-grained
understanding of both the visual content of images and the textual content of questions …

Cross-modality multiple relations learning for knowledge-based visual question answering

Y Wang, P Li, Q Si, H Zhang, W Zang, Z Lin… - ACM Transactions on …, 2023 - dl.acm.org
Knowledge-based visual question answering not only needs to answer the questions based
on images but also incorporates external knowledge to study reasoning in the joint space of …

Visual question answering: A survey on techniques and common trends in recent literature

ACAM de Faria, FC Bastos, JVNA da Silva… - arXiv preprint arXiv …, 2023 - arxiv.org
Visual Question Answering (VQA) is an emerging area of interest for researches, being a
recent problem in natural language processing and image prediction. In this area, an …