The multi-modal fusion in visual question answering: a review of attention mechanisms

S Lu, M Liu, L Yin, Z Yin, X Liu, W Zheng - PeerJ Computer Science, 2023 - peerj.com
Abstract Visual Question Answering (VQA) is a significant cross-disciplinary issue in the
fields of computer vision and natural language processing that requires a computer to output …

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

J Li, D Li, C Xiong, S Hoi - International conference on …, 2022 - proceedings.mlr.press
Abstract Vision-Language Pre-training (VLP) has advanced the performance for many vision-
language tasks. However, most existing pre-trained models only excel in either …

Prompt programming for large language models: Beyond the few-shot paradigm

L Reynolds, K McDonell - Extended abstracts of the 2021 CHI …, 2021 - dl.acm.org
Prevailing methods for mapping large generative language models to supervised tasks may
fail to sufficiently probe models' novel capabilities. Using GPT-3 as a case study, we show …

Recent advances in deep learning based dialogue systems: A systematic survey

J Ni, T Young, V Pandelea, F Xue… - Artificial intelligence review, 2023 - Springer
Dialogue systems are a popular natural language processing (NLP) task as it is promising in
real-life applications. It is also a complicated task since many NLP tasks deserving study are …

Large-scale adversarial training for vision-and-language representation learning

Z Gan, YC Chen, L Li, C Zhu… - Advances in Neural …, 2020 - proceedings.neurips.cc
We present VILLA, the first known effort on large-scale adversarial training for vision-and-
language (V+ L) representation learning. VILLA consists of two training stages:(i) task …

Attention, please! A survey of neural attention models in deep learning

A de Santana Correia, EL Colombini - Artificial Intelligence Review, 2022 - Springer
In humans, Attention is a core property of all perceptual and cognitive operations. Given our
limited ability to process competing sources, attention mechanisms select, modulate, and …

Two causal principles for improving visual dialog

J Qi, Y Niu, J Huang, H Zhang - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com
This paper unravels the design tricks adopted by us, the champion team MReaL-BDAI, for
Visual Dialog Challenge 2019: two causal principles for improving Visual Dialog (VisDial) …

Utc: A unified transformer with inter-task contrastive learning for visual dialog

C Chen, Z Tan, Q Cheng, X Jiang… - Proceedings of the …, 2022 - openaccess.thecvf.com
Visual Dialog aims to answer multi-round, interactive questions based on the dialog history
and image content. Existing methods either consider answer ranking and generating …

Large-scale pretraining for visual dialog: A simple state-of-the-art baseline

V Murahari, D Batra, D Parikh, A Das - European Conference on Computer …, 2020 - Springer
Prior work in visual dialog has focused on training deep neural models on VisDial in
isolation. Instead, we present an approach to leverage pretraining on related vision …