[HTML][HTML] New ideas and trends in deep multimodal content understanding: A review

W Chen, W Wang, L Liu, MS Lew - Neurocomputing, 2021 - Elsevier
The focus of this survey is on the analysis of two modalities of multimodal deep learning:
image and text. Unlike classic reviews of deep learning where monomodal image classifiers …

An improved attention for visual question answering

T Rahman, SH Chou, L Sigal… - Proceedings of the …, 2021 - openaccess.thecvf.com
We consider the problem of Visual Question Answering (VQA). Given an image and a free-
form, open-ended, question, expressed in natural language, the goal of VQA system is to …

A survey of methods, datasets and evaluation metrics for visual question answering

H Sharma, AS Jalal - Image and Vision Computing, 2021 - Elsevier
Abstract Visual Question Answering (VQA) is a multi-disciplinary research problem that has
captured the attention of both computer vision as well as natural language processing …

Visual question answering using deep learning: A survey and performance analysis

Y Srivastava, V Murali, SR Dubey… - Computer Vision and …, 2021 - Springer
Abstract The Visual Question Answering (VQA) task combines challenges for processing
data with both Visual and Linguistic processing, to answer basic 'common sense'questions …

Dialoguetrm: Exploring the intra-and inter-modal emotional behaviors in the conversation

Y Mao, Q Sun, G Liu, X Wang, W Gao, X Li… - arXiv preprint arXiv …, 2020 - arxiv.org
Emotion Recognition in Conversations (ERC) is essential for building empathetic human-
machine systems. Existing studies on ERC primarily focus on summarizing the context …

[HTML][HTML] A systematic evaluation of GPT-4V's multimodal capability for chest X-ray image analysis

Y Liu, Y Li, Z Wang, X Liang, L Liu, L Wang, L Cui, Z Tu… - Meta-Radiology, 2024 - Elsevier
This work evaluates GPT-4V's multimodal capability for medical image analysis, focusing on
three representative tasks radiology report generation, medical visual question answering …

Multi-concept representation learning for knowledge graph completion

J Wang, B Wang, J Gao, Y Hu, B Yin - ACM Transactions on Knowledge …, 2023 - dl.acm.org
Knowledge Graph Completion (KGC) aims at inferring missing entities or relations by
embedding them in a low-dimensional space. However, most existing KGC methods …

Q2atransformer: Improving medical vqa via an answer querying decoder

Y Liu, Z Wang, D Xu, L Zhou - International Conference on Information …, 2023 - Springer
Abstract Medical Visual Question Answering (VQA) systems play a supporting role to
understand clinic-relevant information carried by medical images. The questions to a …

Bilateral cross-modality graph matching attention for feature fusion in visual question answering

J Cao, X Qin, S Zhao, J Shen - IEEE Transactions on Neural …, 2022 - ieeexplore.ieee.org
Answering semantically complicated questions according to an image is challenging in a
visual question answering (VQA) task. Although the image can be well represented by deep …

Global-local cross-view fisher discrimination for view-invariant action recognition

L Gao, Y Ji, Y Yang, HT Shen - … of the 30th ACM International Conference …, 2022 - dl.acm.org
View change brings a significant challenge to action representation and recognition due to
pose occlusion and deformation. We propose a Global-Local Cross-View Fisher …