The multi-modal fusion in visual question answering: a review of attention mechanisms

S Lu, M Liu, L Yin, Z Yin, X Liu, W Zheng - PeerJ Computer Science, 2023 - peerj.com
Abstract Visual Question Answering (VQA) is a significant cross-disciplinary issue in the
fields of computer vision and natural language processing that requires a computer to output …

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Multiscale feature extraction and fusion of image and text in VQA

S Lu, Y Ding, M Liu, Z Yin, L Yin, W Zheng - International Journal of …, 2023 - Springer
Abstract The Visual Question Answering (VQA) system is the process of finding useful
information from images related to the question to answer the question correctly. It can be …

A review on the attention mechanism of deep learning

Z Niu, G Zhong, H Yu - Neurocomputing, 2021 - Elsevier
Attention has arguably become one of the most important concepts in the deep learning
field. It is inspired by the biological systems of humans that tend to focus on the distinctive …

A general survey on attention mechanisms in deep learning

G Brauwers, F Frasincar - IEEE Transactions on Knowledge …, 2021 - ieeexplore.ieee.org
Attention is an important mechanism that can be employed for a variety of deep learning
models across many different domains and tasks. This survey provides an overview of the …

[HTML][HTML] A review of uncertainty quantification in deep learning: Techniques, applications and challenges

M Abdar, F Pourpanah, S Hussain, D Rezazadegan… - Information fusion, 2021 - Elsevier
Uncertainty quantification (UQ) methods play a pivotal role in reducing the impact of
uncertainties during both optimization and decision making processes. They have been …

Counterfactual attention learning for fine-grained visual categorization and re-identification

Y Rao, G Chen, J Lu, J Zhou - Proceedings of the IEEE/CVF …, 2021 - openaccess.thecvf.com
Attention mechanism has demonstrated great potential in fine-grained visual recognition
tasks. In this paper, we present a counterfactual attention learning method to learn more …

Multimodal co-attention transformer for survival prediction in gigapixel whole slide images

RJ Chen, MY Lu, WH Weng, TY Chen… - Proceedings of the …, 2021 - openaccess.thecvf.com
Survival outcome prediction is a challenging weakly-supervised and ordinal regression task
in computational pathology that involves modeling complex interactions within the tumor …

Seeing out of the box: End-to-end pre-training for vision-language representation learning

Z Huang, Z Zeng, Y Huang, B Liu… - Proceedings of the …, 2021 - openaccess.thecvf.com
We study on joint learning of Convolutional Neural Network (CNN) and Transformer for
vision-language pre-training (VLPT) which aims to learn cross-modal alignments from …

Multi-granularity cross-modal alignment for generalized medical visual representation learning

F Wang, Y Zhou, S Wang… - Advances in Neural …, 2022 - proceedings.neurips.cc
Learning medical visual representations directly from paired radiology reports has become
an emerging topic in representation learning. However, existing medical image-text joint …