Git: A generative image-to-text transformer for vision and language

J Wang, Z Yang, X Hu, L Li, K Lin, Z Gan, Z Liu… - arXiv preprint arXiv …, 2022 - arxiv.org
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify
vision-language tasks such as image/video captioning and question answering. While …

Keeping your eye on the ball: Trajectory attention in video transformers

M Patrick, D Campbell, Y Asano… - Advances in neural …, 2021 - proceedings.neurips.cc
In video transformers, the time dimension is often treated in the same way as the two spatial
dimensions. However, in a scene where objects or the camera may move, a physical point …

Latr: Layout-aware transformer for scene-text vqa

AF Biten, R Litman, Y Xie… - Proceedings of the …, 2022 - openaccess.thecvf.com
We propose a novel multimodal architecture for Scene Text Visual Question Answering
(STVQA), named Layout-Aware Transformer (LaTr). The task of STVQA requires models to …

Tap: Text-aware pre-training for text-vqa and text-caption

Z Yang, Y Lu, J Wang, X Yin… - Proceedings of the …, 2021 - openaccess.thecvf.com
In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption
tasks. These two tasks aim at reading and understanding scene text in images for question …

Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text

A Singh, G Pang, M Toh, J Huang… - Proceedings of the …, 2021 - openaccess.thecvf.com
A crucial component for the scene text based reasoning required for TextVQA and TextCaps
datasets involve detecting and recognizing text present in the images using an optical …

Latent variable sequential set transformers for joint multi-agent motion prediction

R Girgis, F Golemo, F Codevilla, M Weiss… - arXiv preprint arXiv …, 2021 - arxiv.org
Robust multi-agent trajectory prediction is essential for the safe control of robotic systems. A
major challenge is to efficiently learn a representation that approximates the true joint …

Docformerv2: Local features for document understanding

S Appalaraju, P Tang, Q Dong, N Sankaran… - Proceedings of the …, 2024 - ojs.aaai.org
We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding
(VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) …

Rosita: Enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration

Y Cui, Z Yu, C Wang, Z Zhao, J Zhang… - Proceedings of the 29th …, 2021 - dl.acm.org
Vision-and-language pretraining (VLP) aims to learn generic multimodal representations
from massive image-text pairs. While various successful attempts have been proposed …

Weakly-supervised 3d spatial reasoning for text-based visual question answering

H Li, J Huang, P Jin, G Song, Q Wu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Text-based Visual Question Answering (TextVQA) aims to produce correct answers for given
questions about the images with multiple scene texts. In most cases, the texts naturally …

Multimodal learning using optimal transport for sarcasm and humor detection

S Pramanick, A Roy, VM Patel - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Multimodal learning is an emerging yet challenging research area. In this paper, we deal
with multimodal sarcasm and humor detection from conversational videos and image-text …