Multimodal residual learning for visual qa

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

被引用次数：188 相关文章所有 7 个版本

[PDF] arxiv.org

Multimodal intelligence: Representation learning, information fusion, and applications

C Zhang, Z Yang, X He, L Deng - IEEE Journal of Selected …, 2020 - ieeexplore.ieee.org

Deep learning methods haverevolutionized speech recognition, image recognition, and
natural language processing since 2010. Each of these tasks involves a single modality in …

被引用次数：415 相关文章所有 3 个版本

[PDF] springer.com

Multiscale feature extraction and fusion of image and text in VQA

S Lu, Y Ding, M Liu, Z Yin, L Yin, W Zheng - International Journal of …, 2023 - Springer

Abstract The Visual Question Answering (VQA) system is the process of finding useful
information from images related to the question to answer the question correctly. It can be …

被引用次数：192 相关文章所有 5 个版本

[PDF] thecvf.com

Deep modular co-attention networks for visual question answering

Z Yu, J Yu, Y Cui, D Tao, Q Tian - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com

Abstract Visual Question Answering (VQA) requires a fine-grained and simultaneous
understanding of both the visual content of images and the textual content of questions …

被引用次数：1022 相关文章所有 11 个版本

[PDF] researchgate.net

Attention, please! A survey of neural attention models in deep learning

A de Santana Correia, EL Colombini - Artificial Intelligence Review, 2022 - Springer

In humans, Attention is a core property of all perceptual and cognitive operations. Given our
limited ability to process competing sources, attention mechanisms select, modulate, and …

被引用次数：216 相关文章所有 8 个版本

[PDF] thecvf.com

Image retrieval on real-life images with pre-trained vision-and-language models

Z Liu, C Rodriguez-Opazo… - Proceedings of the …, 2021 - openaccess.thecvf.com

We extend the task of composed image retrieval, where an input query consists of an image
and short textual description of how to modify the image. Existing methods have only been …

被引用次数：186 相关文章所有 7 个版本

[PDF] neurips.cc

Bilinear attention networks

JH Kim, J Jun, BT Zhang - Advances in neural information …, 2018 - proceedings.neurips.cc

Attention networks in multimodal learning provide an efficient way to utilize given visual
information selectively. However, the computational cost to learn attention distributions for …

被引用次数：1064 相关文章所有 8 个版本

Deep multimodal learning: A survey on recent advances and trends

D Ramachandram, GW Taylor - IEEE signal processing …, 2017 - ieeexplore.ieee.org

The success of deep learning has been a catalyst to solving increasingly complex machine-
learning problems, which often involve multiple data modalities. We review recent advances …

被引用次数：1007 相关文章所有 3 个版本

[PDF] thecvf.com

Residual attention network for image classification

F Wang, M Jiang, C Qian, S Yang… - Proceedings of the …, 2017 - openaccess.thecvf.com

In this work, we propose" Residual Attention Network", a convolutional neural network using
attention mechanism which can incorporate with state-of-art feed forward network …

被引用次数：4514 相关文章所有 10 个版本

[PDF] thecvf.com

Fashionvlp: Vision language transformer for fashion retrieval with feedback

S Goenka, Z Zheng, A Jaiswal… - Proceedings of the …, 2022 - openaccess.thecvf.com

Fashion image retrieval based on a query pair of reference image and natural language
feedback is a challenging task that requires models to assess fashion related information …

被引用次数：94 相关文章所有 5 个版本

高级搜索

QQ 群