Visual question answering: A survey on techniques and common trends in recent literature

P Zhou, L Wang, Z Liu, Y Hao, P Hui, S Tarkoma… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper offers an insightful examination of how currently top-trending AI technologies, ie,
generative artificial intelligence (Generative AI) and large language models (LLMs), are …

被引用次数：32 相关文章所有 8 个版本

VQA and visual reasoning: An overview of approaches, datasets, and future direction

RY Zakari, JW Owusu, K Qin, H Wang, ZK Lawal, T He - Neurocomputing, 2025 - Elsevier

Visual question answering (VQA) is a problem that researchers in both computer vision and
natural language processing are interested in studying. In VQA, a system is given an image …

被引用次数：1 相关文章

[PDF] mdpi.com

VL-Meta: Vision-Language Models for Multimodal Meta-Learning

H Ma, B Fan, BK Ng, CT Lam - Mathematics, 2024 - mdpi.com

Multimodal learning is a promising area in artificial intelligence (AI) that can make the model
understand different kinds of data. Existing works are trying to re-train a new model based …

被引用次数：3 相关文章所有 5 个版本

[PDF] arxiv.org

Sportify: Question Answering with Embedded Visualizations and Personified Narratives for Sports Video

C Lee, T Lin, H Pfister… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

As basketball's popularity surges, fans often find themselves confused and overwhelmed by
the rapid game pace and complexity. Basketball tactics, involving a complex series of …

Superpixel semantics representation and pre-training for vision–language tasks

S Zhang, Y Chen, Y Sun, F Wang, J Yang, L Bai, S Gao - Neurocomputing, 2025 - Elsevier

The key to integrating visual language tasks is to establish a good alignment strategy.
Recently, visual semantic representation has achieved fine-grained visual understanding by …

[PDF] arxiv.org

Multiscale Superpixel Structured Difference Graph Convolutional Network for VL Representation

S Zhang, Y Chen, S Cheng, Y Sun, J Yang… - arXiv preprint arXiv …, 2023 - arxiv.org

Within the multimodal field, the key to integrating vision and language lies in establishing a
good alignment strategy. Recently, benefiting from the success of self-supervised learning …

Improving Learning from Visual Demonstration Methods by Target Localization

P Foggia, F Rosa, M Vento - 2024 33rd IEEE International …, 2024 - ieeexplore.ieee.org

This paper presents a novel approach to multi-task visual-guided imitation learning. Upon
evaluating the current state-of-the-art method, we observed its capability to replicate the …

[PDF] arxiv.org

SparrowVQE: Visual Question Explanation for Course Content Understanding

J Li, MK Thota, R Gokhman, R Holik… - 2024 IEEE International …, 2024 - ieeexplore.ieee.org

Visual Question Answering (VQA) research seeks to create AI systems to answer natural
language questions in images, yet VQA methods often yield overly simplistic and short …

高级搜索

QQ 群