Iterative answer prediction with pointer-augmented multimodal transformers for textvqa

T Lin, Y Wang, X Liu, X Qiu - AI open, 2022 - Elsevier

Transformers have achieved great success in many artificial intelligence fields, such as
natural language processing, computer vision, and audio processing. Therefore, it is natural …

被引用次数：977 相关文章所有 4 个版本

[PDF] mi-research.net

Transformer: A general framework from machine translation to others

Y Zhao, J Zhang, C Zong - Machine Intelligence Research, 2023 - Springer

Abstract Machine translation is an important and challenging task that aims at automatically
translating natural language sentences from one language into another. Recently …

被引用次数：18 相关文章所有 4 个版本

[PDF] arxiv.org

Git: A generative image-to-text transformer for vision and language

J Wang, Z Yang, X Hu, L Li, K Lin, Z Gan, Z Liu… - arXiv preprint arXiv …, 2022 - arxiv.org

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify
vision-language tasks such as image/video captioning and question answering. While …

被引用次数：401 相关文章所有 4 个版本

[PDF] thecvf.com

Flava: A foundational language and vision alignment model

A Singh, R Hu, V Goswami… - Proceedings of the …, 2022 - openaccess.thecvf.com

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic
pretraining for obtaining good performance on a variety of downstream tasks. Generally …

被引用次数：528 相关文章所有 6 个版本

[PDF] thecvf.com

Unit: Multimodal multitask learning with a unified transformer

R Hu, A Singh - Proceedings of the IEEE/CVF international …, 2021 - openaccess.thecvf.com

Abstract We propose UniT, a Unified Transformer model to simultaneously learn the most
prominent tasks across different domains, ranging from object detection to natural language …

被引用次数：341 相关文章所有 6 个版本

[PDF] thecvf.com

Docvqa: A dataset for vqa on document images

M Mathew, D Karatzas… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

We present a new dataset for Visual Question Answering (VQA) on document images called
DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images …

被引用次数：316 相关文章所有 8 个版本

[PDF] thecvf.com

Mult: An end-to-end multitask learning transformer

D Bhattacharjee, T Zhang… - Proceedings of the …, 2022 - openaccess.thecvf.com

We propose an end-to-end Multitask Learning Transformer framework, named MulT, to
simultaneously learn multiple high-level vision tasks, including depth estimation, semantic …

被引用次数：70 相关文章所有 9 个版本

[PDF] thecvf.com

Align and attend: Multimodal summarization with dual contrastive losses

B He, J Wang, J Qiu, T Bui… - Proceedings of the …, 2023 - openaccess.thecvf.com

The goal of multimodal summarization is to extract the most important information from
different modalities to form summaries. Unlike unimodal summarization, the multimodal …

被引用次数：34 相关文章所有 7 个版本

[PDF] arxiv.org

Textcaps: a dataset for image captioning with reading comprehension

O Sidorov, R Hu, M Rohrbach, A Singh - … 23–28, 2020, Proceedings, Part II …, 2020 - Springer

Image descriptions can help visually impaired people to quickly understand the image
content. While we made significant progress in automatically describing images and optical …

被引用次数：277 相关文章所有 4 个版本

[PDF] arxiv.org

Unitab: Unifying text and box outputs for grounded vision-language modeling

Z Yang, Z Gan, J Wang, X Hu, F Ahmed, Z Liu… - … on Computer Vision, 2022 - Springer

Abstract We propose UniTAB that Unifies Text And Box outputs for grounded vision-
language (VL) modeling. Grounded VL tasks such as grounded captioning require the …

被引用次数：99 相关文章所有 6 个版本

高级搜索

QQ 群