Cascade reasoning network for text-based visual question answering

J Wang, Z Yang, X Hu, L Li, K Lin, Z Gan, Z Liu… - arXiv preprint arXiv …, 2022 - arxiv.org

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify
vision-language tasks such as image/video captioning and question answering. While …

被引用次数：426 相关文章所有 4 个版本

[PDF] thecvf.com

Latr: Layout-aware transformer for scene-text vqa

AF Biten, R Litman, Y Xie… - Proceedings of the …, 2022 - openaccess.thecvf.com

We propose a novel multimodal architecture for Scene Text Visual Question Answering
(STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA requires models to …

被引用次数：86 相关文章所有 7 个版本

[PDF] thecvf.com

Tap: Text-aware pre-training for text-vqa and text-caption

Z Yang, Y Lu, J Wang, X Yin… - Proceedings of the …, 2021 - openaccess.thecvf.com

In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption
tasks. These two tasks aim at reading and understanding scene text in images for question …

被引用次数：155 相关文章所有 8 个版本

Dual-path rare content enhancement network for image and text matching

Y Wang, Y Su, W Li, J Xiao, X Li… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Image and text matching plays a crucial role in bridging the cross-modal gap between vision
and language, and has achieved great progress due to the deep learning. However, the …

被引用次数：32 相关文章所有 2 个版本

Weakly-supervised 3d spatial reasoning for text-based visual question answering

H Li, J Huang, P Jin, G Song, Q Wu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Text-based Visual Question Answering (TextVQA) aims to produce correct answers for given
questions about the images with multiple scene texts. In most cases, the texts naturally …

被引用次数：17 相关文章所有 5 个版本

[PDF] acm.org

Multi-attention network for compressed video referring object segmentation

W Chen, D Hong, Y Qi, Z Han, S Wang, L Qing… - Proceedings of the 30th …, 2022 - dl.acm.org

Referring video object segmentation aims to segment the object referred by a given
language expression. Existing works typically require compressed video bitstream to be …

被引用次数：30 相关文章所有 4 个版本

[PDF] neurips.cc

Towards video text visual question answering: Benchmark and baseline

M Zhao, B Li, J Wang, W Li, W Zhou… - Advances in …, 2022 - proceedings.neurips.cc

There are already some text-based visual question answering (TextVQA) benchmarks for
developing machine's ability to answer questions based on texts in images in recent years …

被引用次数：18 相关文章所有 7 个版本

A survey of methods, datasets and evaluation metrics for visual question answering

H Sharma, AS Jalal - Image and Vision Computing, 2021 - Elsevier

Abstract Visual Question Answering (VQA) is a multi-disciplinary research problem that has
captured the attention of both computer vision as well as natural language processing …

被引用次数：39 相关文章所有 2 个版本

[PDF] google.com

Beyond ocr+ vqa: involving ocr into the flow for robust and accurate textvqa

G Zeng, Y Zhang, Y Zhou, X Yang - Proceedings of the 29th ACM …, 2021 - dl.acm.org

Text-based visual question answering (TextVQA) requires analyzing both the visual contents
and texts in an image to answer a question, which is more practical than general visual …

被引用次数：38 相关文章所有 2 个版本

[PDF] arxiv.org

Exploring sparse spatial relation in graph inference for text-based vqa

S Zhou, D Guo, J Li, X Yang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Text-based visual question answering (TextVQA) faces the significant challenge of avoiding
redundant relational inference. To be specific, a large number of detected objects and …

被引用次数：7 相关文章所有 6 个版本

高级搜索

QQ 群