A-okvqa: A benchmark for visual question answering using world knowledge

D Schwenk, A Khandelwal, C Clark, K Marino… - European conference on …, 2022 - Springer
Abstract The Visual Question Answering (VQA) task aspires to provide a meaningful testbed
for the development of AI models that can jointly reason over visual and natural language …

Flava: A foundational language and vision alignment model

A Singh, R Hu, V Goswami… - Proceedings of the …, 2022 - openaccess.thecvf.com
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic
pretraining for obtaining good performance on a variety of downstream tasks. Generally …

How much can clip benefit vision-and-language tasks?

S Shen, LH Li, H Tan, M Bansal, A Rohrbach… - arXiv preprint arXiv …, 2021 - arxiv.org
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using
a relatively small set of manually-annotated data (as compared to web-crawled data), to …

Seeing out of the box: End-to-end pre-training for vision-language representation learning

Z Huang, Z Zeng, Y Huang, B Liu… - Proceedings of the …, 2021 - openaccess.thecvf.com
We study on joint learning of Convolutional Neural Network (CNN) and Transformer for
vision-language pre-training (VLPT) which aims to learn cross-modal alignments from …

Lxmert: Learning cross-modality encoder representations from transformers

H Tan, M Bansal - arXiv preprint arXiv:1908.07490, 2019 - arxiv.org
Vision-and-language reasoning requires an understanding of visual concepts, language
semantics, and, most importantly, the alignment and relationships between these two …

The hateful memes challenge: Detecting hate speech in multimodal memes

D Kiela, H Firooz, A Mohan… - Advances in neural …, 2020 - proceedings.neurips.cc
This work proposes a new challenge set for multimodal classification, focusing on detecting
hate speech in multimodal memes. It is constructed such that unimodal models struggle and …

Unified vision-language pre-training for image captioning and vqa

L Zhou, H Palangi, L Zhang, H Hu, J Corso… - Proceedings of the AAAI …, 2020 - ojs.aaai.org
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is
unified in that (1) it can be fine-tuned for either vision-language generation (eg, image …

Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training

G Li, N Duan, Y Fang, M Gong, D Jiang - Proceedings of the AAAI …, 2020 - aaai.org
We propose Unicoder-VL, a universal encoder that aims to learn joint representations of
vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained …

Docvqa: A dataset for vqa on document images

M Mathew, D Karatzas… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
We present a new dataset for Visual Question Answering (VQA) on document images called
DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images …

Pixel-bert: Aligning image pixels with text by deep multi-modal transformers

Z Huang, Z Zeng, B Liu, D Fu, J Fu - arXiv preprint arXiv:2004.00849, 2020 - arxiv.org
We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers
that jointly learn visual and language embedding in a unified end-to-end framework. We aim …