The multi-modal fusion in visual question answering: a review of attention mechanisms

S Lu, M Liu, L Yin, Z Yin, X Liu, W Zheng - PeerJ Computer Science, 2023 - peerj.com
Abstract Visual Question Answering (VQA) is a significant cross-disciplinary issue in the
fields of computer vision and natural language processing that requires a computer to output …

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

卷积神经网络研究综述

李彦冬, 郝宗波, 雷航 - 计算机应用, 2016 - joca.cn
近年来, 卷积神经网络在图像分类, 目标检测, 图像语义分割等领域取得了一系列突破性的研究
成果, 其强大的特征学习与分类能力引起了广泛的关注, 具有重要的分析与研究价值 …

Multiscale feature extraction and fusion of image and text in VQA

S Lu, Y Ding, M Liu, Z Yin, L Yin, W Zheng - International Journal of …, 2023 - Springer
Abstract The Visual Question Answering (VQA) system is the process of finding useful
information from images related to the question to answer the question correctly. It can be …

Knowledge base graph embedding module design for Visual question answering model

W Zheng, L Yin, X Chen, Z Ma, S Liu, B Yang - Pattern recognition, 2021 - Elsevier
In this paper, a knowledge base graph embedding module is constructed to extend the
versatility of knowledge-based VQA (Visual Question Answering) models. The knowledge …

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

A Miech, D Zhukov, JB Alayrac… - Proceedings of the …, 2019 - openaccess.thecvf.com
Learning text-video embeddings usually requires a dataset of video clips with manually
provided captions. However, such datasets are expensive and time consuming to create and …

Ok-vqa: A visual question answering benchmark requiring external knowledge

K Marino, M Rastegari, A Farhadi… - Proceedings of the …, 2019 - openaccess.thecvf.com
Abstract Visual Question Answering (VQA) in its ideal form lets us study reasoning in the
joint space of vision and language and serves as a proxy for the AI task of scene …

Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa

K Marino, X Chen, D Parikh, A Gupta… - Proceedings of the …, 2021 - openaccess.thecvf.com
One of the most challenging question types in VQA is when answering the question requires
outside knowledge not present in the image. In this work we study open-domain knowledge …

Cascade r-cnn: Delving into high quality object detection

Z Cai, N Vasconcelos - … of the IEEE conference on computer …, 2018 - openaccess.thecvf.com
In object detection, an intersection over union (IoU) threshold is required to define positives
and negatives. An object detector, trained with low IoU threshold, eg 0.5, usually produces …

Multimodal intelligence: Representation learning, information fusion, and applications

C Zhang, Z Yang, X He, L Deng - IEEE Journal of Selected …, 2020 - ieeexplore.ieee.org
Deep learning methods haverevolutionized speech recognition, image recognition, and
natural language processing since 2010. Each of these tasks involves a single modality in …