Spatially aware multimodal transformers for textvqa

J Wang, Z Yang, X Hu, L Li, K Lin, Z Gan, Z Liu… - arXiv preprint arXiv …, 2022 - arxiv.org

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify
vision-language tasks such as image/video captioning and question answering. While …

被引用次数：409 相关文章所有 4 个版本

[PDF] neurips.cc

Keeping your eye on the ball: Trajectory attention in video transformers

M Patrick, D Campbell, Y Asano… - Advances in neural …, 2021 - proceedings.neurips.cc

In video transformers, the time dimension is often treated in the same way as the two spatial
dimensions. However, in a scene where objects or the camera may move, a physical point …

被引用次数：241 相关文章所有 13 个版本

[PDF] thecvf.com

Latr: Layout-aware transformer for scene-text vqa

AF Biten, R Litman, Y Xie… - Proceedings of the …, 2022 - openaccess.thecvf.com

We propose a novel multimodal architecture for Scene Text Visual Question Answering
(STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA requires models to …

被引用次数：86 相关文章所有 7 个版本

[PDF] thecvf.com

Tap: Text-aware pre-training for text-vqa and text-caption

Z Yang, Y Lu, J Wang, X Yin… - Proceedings of the …, 2021 - openaccess.thecvf.com

In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption
tasks. These two tasks aim at reading and understanding scene text in images for question …

被引用次数：153 相关文章所有 8 个版本

[PDF] thecvf.com

Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text

A Singh, G Pang, M Toh, J Huang… - Proceedings of the …, 2021 - openaccess.thecvf.com

A crucial component for the scene text based reasoning required for TextVQA and TextCaps
datasets involve detecting and recognizing text present in the images using an optical …

被引用次数：127 相关文章所有 7 个版本

[PDF] arxiv.org

Latent variable sequential set transformers for joint multi-agent motion prediction

R Girgis, F Golemo, F Codevilla, M Weiss… - arXiv preprint arXiv …, 2021 - arxiv.org

Robust multi-agent trajectory prediction is essential for the safe control of robotic systems. A
major challenge is to efficiently learn a representation that approximates the true joint …

被引用次数：98 相关文章所有 6 个版本

[PDF] aaai.org

Docformerv2: Local features for document understanding

S Appalaraju, P Tang, Q Dong, N Sankaran… - Proceedings of the …, 2024 - ojs.aaai.org

We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding
(VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) …

被引用次数：21 相关文章所有 4 个版本

[PDF] arxiv.org

Rosita: Enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration

Y Cui, Z Yu, C Wang, Z Zhao, J Zhang… - Proceedings of the 29th …, 2021 - dl.acm.org

Vision-and-language pretraining (VLP) aims to learn generic multimodal representations
from massive image-text pairs. While various successful attempts have been proposed …

被引用次数：56 相关文章所有 4 个版本

Weakly-supervised 3d spatial reasoning for text-based visual question answering

H Li, J Huang, P Jin, G Song, Q Wu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Text-based Visual Question Answering (TextVQA) aims to produce correct answers for given
questions about the images with multiple scene texts. In most cases, the texts naturally …

被引用次数：17 相关文章所有 5 个版本

[PDF] thecvf.com

Multimodal learning using optimal transport for sarcasm and humor detection

S Pramanick, A Roy, VM Patel - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

Multimodal learning is an emerging yet challenging research area. In this paper, we deal
with multimodal sarcasm and humor detection from conversational videos and image-text …

被引用次数：42 相关文章所有 6 个版本

高级搜索

QQ 群