The multi-modal fusion in visual question answering: a review of attention mechanisms

S Lu, M Liu, L Yin, Z Yin, X Liu, W Zheng - PeerJ Computer Science, 2023 - peerj.com
Abstract Visual Question Answering (VQA) is a significant cross-disciplinary issue in the
fields of computer vision and natural language processing that requires a computer to output …

Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

Attention, please! A survey of neural attention models in deep learning

A de Santana Correia, EL Colombini - Artificial Intelligence Review, 2022 - Springer
In humans, Attention is a core property of all perceptual and cognitive operations. Given our
limited ability to process competing sources, attention mechanisms select, modulate, and …

The garden of forking paths: Towards multi-future trajectory prediction

J Liang, L Jiang, K Murphy, T Yu… - Proceedings of the …, 2020 - openaccess.thecvf.com
This paper studies the problem of predicting the distribution over multiple possible future
paths of people as they move through various visual scenes. We make two main …

Entity-enhanced adaptive reconstruction network for weakly supervised referring expression grounding

X Liu, L Li, S Wang, ZJ Zha, Z Li… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Weakly supervised Referring Expression Grounding (REG) aims to ground a particular
target in an image described by a language expression while lacking the correspondence …

Digital twin-driven focal modulation-based convolutional network for intelligent fault diagnosis

S Li, Q Jiang, Y Xu, K Feng, Y Wang, B Sun… - Reliability Engineering & …, 2023 - Elsevier
Rolling bearings are essential components of various rotating machinery and are critical in
ensuring safe and reliable industrial production. Deep learning techniques have …

SimAug: Learning Robust Representations from Simulation for Trajectory Prediction

J Liang, L Jiang, A Hauptmann - … Conference, Glasgow, UK, August 23–28 …, 2020 - Springer
This paper studies the problem of predicting future trajectories of people in unseen cameras
of novel scenarios and views. We approach this problem through the real-data-free setting in …

Reasoning on the relation: Enhancing visual representation for visual question answering and cross-modal retrieval

J Yu, W Zhang, Y Lu, Z Qin, Y Hu… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Cross-modal analysis has become a promising direction for artificial intelligence. Visual
representation is crucial for various cross-modal analysis tasks that require visual content …

Temporal sentence grounding in videos: A survey and future directions

H Zhang, A Sun, W Jing, JT Zhou - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …

Cross-domain image captioning via cross-modal retrieval and model adaptation

W Zhao, X Wu, J Luo - IEEE Transactions on Image Processing, 2020 - ieeexplore.ieee.org
In recent years, large scale datasets of paired images and sentences have enabled the
remarkable success in automatically generating descriptions for images, namely image …