Git: A generative image-to-text transformer for vision and language

J Wang, Z Yang, X Hu, L Li, K Lin, Z Gan, Z Liu… - arXiv preprint arXiv …, 2022 - arxiv.org
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify
vision-language tasks such as image/video captioning and question answering. While …

Latr: Layout-aware transformer for scene-text vqa

AF Biten, R Litman, Y Xie… - Proceedings of the …, 2022 - openaccess.thecvf.com
We propose a novel multimodal architecture for Scene Text Visual Question Answering
(STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA requires models to …

Tap: Text-aware pre-training for text-vqa and text-caption

Z Yang, Y Lu, J Wang, X Yin… - Proceedings of the …, 2021 - openaccess.thecvf.com
In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption
tasks. These two tasks aim at reading and understanding scene text in images for question …

Dual-path rare content enhancement network for image and text matching

Y Wang, Y Su, W Li, J Xiao, X Li… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Image and text matching plays a crucial role in bridging the cross-modal gap between vision
and language, and has achieved great progress due to the deep learning. However, the …

Weakly-supervised 3d spatial reasoning for text-based visual question answering

H Li, J Huang, P Jin, G Song, Q Wu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Text-based Visual Question Answering (TextVQA) aims to produce correct answers for given
questions about the images with multiple scene texts. In most cases, the texts naturally …

Multi-attention network for compressed video referring object segmentation

W Chen, D Hong, Y Qi, Z Han, S Wang, L Qing… - Proceedings of the 30th …, 2022 - dl.acm.org
Referring video object segmentation aims to segment the object referred by a given
language expression. Existing works typically require compressed video bitstream to be …

Towards video text visual question answering: Benchmark and baseline

M Zhao, B Li, J Wang, W Li, W Zhou… - Advances in …, 2022 - proceedings.neurips.cc
There are already some text-based visual question answering (TextVQA) benchmarks for
developing machine's ability to answer questions based on texts in images in recent years …

A survey of methods, datasets and evaluation metrics for visual question answering

H Sharma, AS Jalal - Image and Vision Computing, 2021 - Elsevier
Abstract Visual Question Answering (VQA) is a multi-disciplinary research problem that has
captured the attention of both computer vision as well as natural language processing …

Beyond ocr+ vqa: involving ocr into the flow for robust and accurate textvqa

G Zeng, Y Zhang, Y Zhou, X Yang - Proceedings of the 29th ACM …, 2021 - dl.acm.org
Text-based visual question answering (TextVQA) requires analyzing both the visual contents
and texts in an image to answer a question, which is more practical than general visual …

Exploring sparse spatial relation in graph inference for text-based vqa

S Zhou, D Guo, J Li, X Yang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Text-based visual question answering (TextVQA) faces the significant challenge of avoiding
redundant relational inference. To be specific, a large number of detected objects and …