Visual question answering with dense inter-and intra-modality interactions

AM Farahani, P Adibi, MS Ehsani, HP Hutter… - IEEE …, 2023 - ieeexplore.ieee.org

Automated chart analysis has vast potential to improve the accessibility of charts for a wider
audience, eg, people with visual impairments or other disabilities, by generating captions for …

被引用次数：26 相关文章所有 4 个版本

[PDF] thecvf.com

Hair: Hierarchical visual-semantic relational reasoning for video question answering

F Liu, J Liu, W Wang, H Lu - Proceedings of the IEEE/CVF …, 2021 - openaccess.thecvf.com

Relational reasoning is at the heart of video question answering. However, existing
approaches suffer from several common limitations:(1) they only focus on either object-level …

被引用次数：61 相关文章所有 3 个版本

Test-time model adaptation for visual question answering with debiased self-supervisions

Z Wen, S Niu, G Li, Q Wu, M Tan… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Visual question answering (VQA) is a prevalent task in real-world, and plays an essential
role in helping the blind understand the physical world. However, due to the real-world …

被引用次数：17 相关文章所有 2 个版本

[PDF] arxiv.org

Vlab: Enhancing video language pre-training by feature adapting and blending

X He, S Chen, F Ma, Z Huang, X Jin… - IEEE Transactions …, 2024 - ieeexplore.ieee.org

Large-scale image-text contrastive pre-training models, such as CLIP, have been
demonstrated to effectively learn high-quality multimodal representations. However, there is …

被引用次数：21 相关文章所有 3 个版本

Encoder–decoder cycle for visual question answering based on perception-action cycle

SAM Mohamud, A Jalali, M Lee - Pattern Recognition, 2023 - Elsevier

In this study, we propose a novel encoder–decoder cycle (EDC) framework inspired by the
human learning process called the perception-action cycle to tackle challenging problems …

被引用次数：14 相关文章所有 3 个版本

Causal inference with knowledge distilling and curriculum learning for unbiased VQA

Y Pan, Z Li, L Zhang, J Tang - ACM Transactions on Multimedia …, 2022 - dl.acm.org

Recently, many Visual Question Answering (VQA) models rely on the correlations between
questions and answers yet neglect those between the visual information and the textual …

被引用次数：34 相关文章

[PDF] arxiv.org

Hgan: Hierarchical graph alignment network for image-text retrieval

J Guo, M Wang, Y Zhou, B Song, Y Chi… - IEEE Transactions …, 2023 - ieeexplore.ieee.org

Image-text retrieval (ITR) is a challenging task in the field of multimodal information
processing due to the semantic gap between different modalities. In recent years …

被引用次数：19 相关文章所有 4 个版本

Resolving zero-shot and fact-based visual question answering via enhanced fact retrieval

S Wu, G Zhao, X Qian - IEEE Transactions on Multimedia, 2023 - ieeexplore.ieee.org

Practical applications with visual question answering (VQA) systems are challenging, and
recent research has aimed at investigating this important field. Many issues related to real …

被引用次数：8 相关文章所有 2 个版本

Explicit cross-modal representation learning for visual commonsense reasoning

X Zhang, F Zhang, C Xu - IEEE Transactions on Multimedia, 2021 - ieeexplore.ieee.org

Given a question about an image, Visual Commonsense Reasoning (VCR) needs to provide
not only a correct answer, but also a rationale to justify the answer. VCR is a challenging …

被引用次数：29 相关文章所有 2 个版本

Positional attention guided transformer-like architecture for visual question answering

A Mao, Z Yang, K Lin, J Xuan… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org

Transformer architectures have recently been introduced into the field of visual question
answering (VQA), due to their powerful capabilities of information extraction and fusion …

被引用次数：18 相关文章所有 2 个版本

高级搜索

QQ 群