Automatic chart understanding: a review

AM Farahani, P Adibi, MS Ehsani, HP Hutter… - IEEE …, 2023 - ieeexplore.ieee.org
Automated chart analysis has vast potential to improve the accessibility of charts for a wider
audience, eg, people with visual impairments or other disabilities, by generating captions for …

Hair: Hierarchical visual-semantic relational reasoning for video question answering

F Liu, J Liu, W Wang, H Lu - Proceedings of the IEEE/CVF …, 2021 - openaccess.thecvf.com
Relational reasoning is at the heart of video question answering. However, existing
approaches suffer from several common limitations:(1) they only focus on either object-level …

Test-time model adaptation for visual question answering with debiased self-supervisions

Z Wen, S Niu, G Li, Q Wu, M Tan… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Visual question answering (VQA) is a prevalent task in real-world, and plays an essential
role in helping the blind understand the physical world. However, due to the real-world …

Vlab: Enhancing video language pre-training by feature adapting and blending

X He, S Chen, F Ma, Z Huang, X Jin… - IEEE Transactions …, 2024 - ieeexplore.ieee.org
Large-scale image-text contrastive pre-training models, such as CLIP, have been
demonstrated to effectively learn high-quality multimodal representations. However, there is …

Encoder–decoder cycle for visual question answering based on perception-action cycle

SAM Mohamud, A Jalali, M Lee - Pattern Recognition, 2023 - Elsevier
In this study, we propose a novel encoder–decoder cycle (EDC) framework inspired by the
human learning process called the perception-action cycle to tackle challenging problems …

Causal inference with knowledge distilling and curriculum learning for unbiased VQA

Y Pan, Z Li, L Zhang, J Tang - ACM Transactions on Multimedia …, 2022 - dl.acm.org
Recently, many Visual Question Answering (VQA) models rely on the correlations between
questions and answers yet neglect those between the visual information and the textual …

Hgan: Hierarchical graph alignment network for image-text retrieval

J Guo, M Wang, Y Zhou, B Song, Y Chi… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
Image-text retrieval (ITR) is a challenging task in the field of multimodal information
processing due to the semantic gap between different modalities. In recent years …

Resolving zero-shot and fact-based visual question answering via enhanced fact retrieval

S Wu, G Zhao, X Qian - IEEE Transactions on Multimedia, 2023 - ieeexplore.ieee.org
Practical applications with visual question answering (VQA) systems are challenging, and
recent research has aimed at investigating this important field. Many issues related to real …

Explicit cross-modal representation learning for visual commonsense reasoning

X Zhang, F Zhang, C Xu - IEEE Transactions on Multimedia, 2021 - ieeexplore.ieee.org
Given a question about an image, Visual Commonsense Reasoning (VCR) needs to provide
not only a correct answer, but also a rationale to justify the answer. VCR is a challenging …

Positional attention guided transformer-like architecture for visual question answering

A Mao, Z Yang, K Lin, J Xuan… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Transformer architectures have recently been introduced into the field of visual question
answering (VQA), due to their powerful capabilities of information extraction and fusion …