Abstract Machine translation is an important and challenging task that aims at automatically translating natural language sentences from one language into another. Recently …
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While …
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally …
R Hu, A Singh - Proceedings of the IEEE/CVF international …, 2021 - openaccess.thecvf.com
Abstract We propose UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language …
M Mathew, D Karatzas… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images …
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks, including depth estimation, semantic …
B He, J Wang, J Qiu, T Bui… - Proceedings of the …, 2023 - openaccess.thecvf.com
The goal of multimodal summarization is to extract the most important information from different modalities to form summaries. Unlike unimodal summarization, the multimodal …
Image descriptions can help visually impaired people to quickly understand the image content. While we made significant progress in automatically describing images and optical …
Abstract We propose UniTAB that Unifies Text And Box outputs for grounded vision- language (VL) modeling. Grounded VL tasks such as grounded captioning require the …