A corpus for reasoning about natural language grounded in photographs

A Suhr, S Zhou, A Zhang, I Zhang, H Bai… - arXiv preprint arXiv …, 2018 - arxiv.org
We introduce a new dataset for joint reasoning about natural language and images, with a
focus on semantic diversity, compositionality, and visual reasoning challenges. The data …

A corpus of natural language for visual reasoning

A Suhr, M Lewis, J Yeh, Y Artzi - … of the 55th Annual Meeting of the …, 2017 - aclanthology.org
We present a new visual reasoning language dataset, containing 92,244 pairs of examples
of natural statements grounded in synthetic images with 3,962 unique sentences. We …

Natural language rationales with full-stack visual reasoning: From pixels to semantic frames to commonsense graphs

A Marasović, C Bhagavatula, JS Park, RL Bras… - arXiv preprint arXiv …, 2020 - arxiv.org
Natural language rationales could provide intuitive, higher-level explanations that are easily
understandable by humans, complementing the more broadly studied lower-level …

Multimodal graph networks for compositional generalization in visual question answering

R Saqur, K Narasimhan - Advances in Neural Information …, 2020 - proceedings.neurips.cc
Compositional generalization is a key challenge in grounding natural language to visual
perception. While deep learning models have achieved great success in multimodal tasks …

Why is winoground hard? investigating failures in visuolinguistic compositionality

A Diwan, L Berry, E Choi, D Harwath… - arXiv preprint arXiv …, 2022 - arxiv.org
Recent visuolinguistic pre-trained models show promising progress on various end tasks
such as image retrieval and video captioning. Yet, they fail miserably on the recently …

Visual entailment: A novel task for fine-grained image understanding

N Xie, F Lai, D Doran, A Kadav - arXiv preprint arXiv:1901.06706, 2019 - arxiv.org
Existing visual reasoning datasets such as Visual Question Answering (VQA), often suffer
from biases conditioned on the question, image or answer distributions. The recently …

Learning to generalize to new compositions in image understanding

Y Atzmon, J Berant, V Kezami, A Globerson… - arXiv preprint arXiv …, 2016 - arxiv.org
Recurrent neural networks have recently been used for learning to describe images using
natural language. However, it has been observed that these models generalize poorly to …

Winoground: Probing vision and language models for visio-linguistic compositionality

T Thrush, R Jiang, M Bartolo, A Singh… - Proceedings of the …, 2022 - openaccess.thecvf.com
We present a novel task and dataset for evaluating the ability of vision and language models
to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two …

I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision

S Gu, C Clark, A Kembhavi - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Many high-level skills that are required for computer vision tasks, such as parsing questions,
comparing and contrasting semantics, and writing descriptions, are also required in other …

Nlx-gpt: A model for natural language explanations in vision and vision-language tasks

F Sammani, T Mukherjee… - proceedings of the …, 2022 - openaccess.thecvf.com
Natural language explanation (NLE) models aim at explaining the decision-making process
of a black box system via generating natural language sentences which are human-friendly …