[HTML][HTML] A survey of transformers

T Lin, Y Wang, X Liu, X Qiu - AI open, 2022 - Elsevier
Transformers have achieved great success in many artificial intelligence fields, such as
natural language processing, computer vision, and audio processing. Therefore, it is natural …

Transformer: A general framework from machine translation to others

Y Zhao, J Zhang, C Zong - Machine Intelligence Research, 2023 - Springer
Abstract Machine translation is an important and challenging task that aims at automatically
translating natural language sentences from one language into another. Recently …

Git: A generative image-to-text transformer for vision and language

J Wang, Z Yang, X Hu, L Li, K Lin, Z Gan, Z Liu… - arXiv preprint arXiv …, 2022 - arxiv.org
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify
vision-language tasks such as image/video captioning and question answering. While …

Flava: A foundational language and vision alignment model

A Singh, R Hu, V Goswami… - Proceedings of the …, 2022 - openaccess.thecvf.com
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic
pretraining for obtaining good performance on a variety of downstream tasks. Generally …

Unit: Multimodal multitask learning with a unified transformer

R Hu, A Singh - Proceedings of the IEEE/CVF international …, 2021 - openaccess.thecvf.com
Abstract We propose UniT, a Unified Transformer model to simultaneously learn the most
prominent tasks across different domains, ranging from object detection to natural language …

Docvqa: A dataset for vqa on document images

M Mathew, D Karatzas… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
We present a new dataset for Visual Question Answering (VQA) on document images called
DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images …

Mult: An end-to-end multitask learning transformer

D Bhattacharjee, T Zhang… - Proceedings of the …, 2022 - openaccess.thecvf.com
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to
simultaneously learn multiple high-level vision tasks, including depth estimation, semantic …

Align and attend: Multimodal summarization with dual contrastive losses

B He, J Wang, J Qiu, T Bui… - Proceedings of the …, 2023 - openaccess.thecvf.com
The goal of multimodal summarization is to extract the most important information from
different modalities to form summaries. Unlike unimodal summarization, the multimodal …

Textcaps: a dataset for image captioning with reading comprehension

O Sidorov, R Hu, M Rohrbach, A Singh - … 23–28, 2020, Proceedings, Part II …, 2020 - Springer
Image descriptions can help visually impaired people to quickly understand the image
content. While we made significant progress in automatically describing images and optical …

Unitab: Unifying text and box outputs for grounded vision-language modeling

Z Yang, Z Gan, J Wang, X Hu, F Ahmed, Z Liu… - … on Computer Vision, 2022 - Springer
Abstract We propose UniTAB that Unifies Text And Box outputs for grounded vision-
language (VL) modeling. Grounded VL tasks such as grounded captioning require the …