Ask your neurons: A neural-based approach to answering questions about images

S Lu, M Liu, L Yin, Z Yin, X Liu, W Zheng - PeerJ Computer Science, 2023 - peerj.com

Abstract Visual Question Answering (VQA) is a significant cross-disciplinary issue in the
fields of computer vision and natural language processing that requires a computer to output …

被引用次数：169 相关文章所有 8 个版本

[PDF] nowpublishers.com

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：145 相关文章所有 7 个版本

卷积神经网络研究综述

李彦冬，郝宗波，雷航 - 计算机应用, 2016 - joca.cn

近年来, 卷积神经网络在图像分类, 目标检测, 图像语义分割等领域取得了一系列突破性的研究
成果, 其强大的特征学习与分类能力引起了广泛的关注, 具有重要的分析与研究价值 …

被引用次数：138 相关文章所有 4 个版本

[PDF] springer.com

Multiscale feature extraction and fusion of image and text in VQA

S Lu, Y Ding, M Liu, Z Yin, L Yin, W Zheng - International Journal of …, 2023 - Springer

Abstract The Visual Question Answering (VQA) system is the process of finding useful
information from images related to the question to answer the question correctly. It can be …

被引用次数：165 相关文章所有 5 个版本

Knowledge base graph embedding module design for Visual question answering model

W Zheng, L Yin, X Chen, Z Ma, S Liu, B Yang - Pattern recognition, 2021 - Elsevier

In this paper, a knowledge base graph embedding module is constructed to extend the
versatility of knowledge-based VQA (Visual Question Answering) models. The knowledge …

被引用次数：209 相关文章所有 4 个版本

[PDF] thecvf.com

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

A Miech, D Zhukov, JB Alayrac… - Proceedings of the …, 2019 - openaccess.thecvf.com

Learning text-video embeddings usually requires a dataset of video clips with manually
provided captions. However, such datasets are expensive and time consuming to create and …

被引用次数：1121 相关文章所有 10 个版本

[PDF] thecvf.com

Ok-vqa: A visual question answering benchmark requiring external knowledge

K Marino, M Rastegari, A Farhadi… - Proceedings of the …, 2019 - openaccess.thecvf.com

Abstract Visual Question Answering (VQA) in its ideal form lets us study reasoning in the
joint space of vision and language and serves as a proxy for the AI task of scene …

被引用次数：765 相关文章所有 8 个版本

[PDF] thecvf.com

Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa

K Marino, X Chen, D Parikh, A Gupta… - Proceedings of the …, 2021 - openaccess.thecvf.com

One of the most challenging question types in VQA is when answering the question requires
outside knowledge not present in the image. In this work we study open-domain knowledge …

被引用次数：200 相关文章所有 7 个版本

[PDF] thecvf.com

Cascade r-cnn: Delving into high quality object detection

Z Cai, N Vasconcelos - … of the IEEE conference on computer …, 2018 - openaccess.thecvf.com

In object detection, an intersection over union (IoU) threshold is required to define positives
and negatives. An object detector, trained with low IoU threshold, eg 0.5, usually produces …

被引用次数：5922 相关文章所有 14 个版本

[PDF] arxiv.org

Multimodal intelligence: Representation learning, information fusion, and applications

C Zhang, Z Yang, X He, L Deng - IEEE Journal of Selected …, 2020 - ieeexplore.ieee.org

Deep learning methods haverevolutionized speech recognition, image recognition, and
natural language processing since 2010. Each of these tasks involves a single modality in …

被引用次数：355 相关文章所有 3 个版本

高级搜索

QQ 群