From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities

MF Ishmam, MSH Shovon, MF Mridha, N Dey - Information Fusion, 2024 - Elsevier
The multimodal task of Visual Question Answering (VQA) encompassing elements of
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …

Has Multimodal Learning Delivered Universal Intelligence in Healthcare? A Comprehensive Survey

Q Lin, Y Zhu, X Mei, L Huang, J Ma, K He… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid development of artificial intelligence has constantly reshaped the field of
intelligent healthcare and medicine. As a vital technology, multimodal learning has …

Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering

J Ma, M Hu, P Wang, W Sun, L Song, H Pei… - arXiv preprint arXiv …, 2024 - arxiv.org
Audio-Visual Question Answering (AVQA) is a complex multi-modal reasoning task,
demanding intelligent systems to accurately respond to natural language queries based on …