The multi-modal fusion in visual question answering: a review of attention mechanisms

S Lu, M Liu, L Yin, Z Yin, X Liu, W Zheng - PeerJ Computer Science, 2023 - peerj.com
Abstract Visual Question Answering (VQA) is a significant cross-disciplinary issue in the
fields of computer vision and natural language processing that requires a computer to output …

An analysis of graph convolutional networks and recent datasets for visual question answering

AA Yusuf, F Chong, M Xianling - Artificial Intelligence Review, 2022 - Springer
Graph neural network is a deep learning approach widely applied on structural and non-
structural scenarios due to its substantial performance and interpretability recently. In a non …

MRA-Net: Improving VQA via multi-modal relation attention network

L Peng, Y Yang, Z Wang, Z Huang… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Visual Question Answering (VQA) is a task to answer natural language questions tied to the
content of visual images. Most recent VQA approaches usually apply attention mechanism to …

Dual self-attention with co-attention networks for visual question answering

Y Liu, X Zhang, Q Zhang, C Li, F Huang, X Tang, Z Li - Pattern Recognition, 2021 - Elsevier
Abstract Visual Question Answering (VQA) as an important task in understanding vision and
language has been proposed and aroused wide interests. In previous VQA methods …

Scene graph refinement network for visual question answering

T Qian, J Chen, S Chen, B Wu… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Visual Question Answering aims to answer the free-form natural language question based
on the visual clues in a given image. It is a difficult problem as it requires understanding the …

A survey of methods, datasets and evaluation metrics for visual question answering

H Sharma, AS Jalal - Image and Vision Computing, 2021 - Elsevier
Abstract Visual Question Answering (VQA) is a multi-disciplinary research problem that has
captured the attention of both computer vision as well as natural language processing …

Cascade reasoning network for text-based visual question answering

F Liu, G Xu, Q Wu, Q Du, W Jia, M Tan - Proceedings of the 28th ACM …, 2020 - dl.acm.org
We study the problem of text-based visual question answering (T-VQA) in this paper. Unlike
general visual question answering (VQA) which only builds connections between questions …

DeltaNet: Conditional medical report generation for COVID-19 diagnosis

X Wu, S Yang, Z Qiu, S Ge, Y Yan, X Wu… - arXiv preprint arXiv …, 2022 - arxiv.org
Fast screening and diagnosis are critical in COVID-19 patient treatment. In addition to the
gold standard RT-PCR, radiological imaging like X-ray and CT also works as an important …

Audio-visual event localization by learning spatial and semantic co-attention

C Xue, X Zhong, M Cai, H Chen… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
This work aims to temporally localize events that are both audible and visible in video.
Previous methods mainly focused on temporal modeling of events with simple fusion of …

CRA-Net: Composed relation attention network for visual question answering

L Peng, Y Yang, Z Wang, X Wu, Z Huang - Proceedings of the 27th ACM …, 2019 - dl.acm.org
The task of Visual Question Answering (VQA) is to answer a natural language question tied
to the content of a visual image. Most existing VQA models either apply attention mechanism …