相关文章- 学术资源搜索

A corpus for reasoning about natural language grounded in photographs

A Suhr, S Zhou, A Zhang, I Zhang, H Bai… - arXiv preprint arXiv …, 2018 - arxiv.org

We introduce a new dataset for joint reasoning about natural language and images, with a
focus on semantic diversity, compositionality, and visual reasoning challenges. The data …

被引用次数：524 相关文章所有 8 个版本

[PDF] aclanthology.org

A corpus of natural language for visual reasoning

A Suhr, M Lewis, J Yeh, Y Artzi - … of the 55th Annual Meeting of the …, 2017 - aclanthology.org

We present a new visual reasoning language dataset, containing 92,244 pairs of examples
of natural statements grounded in synthetic images with 3,962 unique sentences. We …

被引用次数：260 相关文章所有 3 个版本

[PDF] arxiv.org

Natural language rationales with full-stack visual reasoning: From pixels to semantic frames to commonsense graphs

A Marasović, C Bhagavatula, JS Park, RL Bras… - arXiv preprint arXiv …, 2020 - arxiv.org

Natural language rationales could provide intuitive, higher-level explanations that are easily
understandable by humans, complementing the more broadly studied lower-level …

被引用次数：55 相关文章所有 3 个版本

[PDF] neurips.cc

Multimodal graph networks for compositional generalization in visual question answering

R Saqur, K Narasimhan - Advances in Neural Information …, 2020 - proceedings.neurips.cc

Compositional generalization is a key challenge in grounding natural language to visual
perception. While deep learning models have achieved great success in multimodal tasks …

被引用次数：61 相关文章所有 6 个版本

[PDF] arxiv.org

Why is winoground hard? investigating failures in visuolinguistic compositionality

A Diwan, L Berry, E Choi, D Harwath… - arXiv preprint arXiv …, 2022 - arxiv.org

Recent visuolinguistic pre-trained models show promising progress on various end tasks
such as image retrieval and video captioning. Yet, they fail miserably on the recently …

被引用次数：39 相关文章所有 4 个版本

[PDF] arxiv.org

Visual entailment: A novel task for fine-grained image understanding

N Xie, F Lai, D Doran, A Kadav - arXiv preprint arXiv:1901.06706, 2019 - arxiv.org

Existing visual reasoning datasets such as Visual Question Answering (VQA), often suffer
from biases conditioned on the question, image or answer distributions. The recently …

被引用次数：289 相关文章所有 4 个版本

[PDF] arxiv.org

Learning to generalize to new compositions in image understanding

Y Atzmon, J Berant, V Kezami, A Globerson… - arXiv preprint arXiv …, 2016 - arxiv.org

Recurrent neural networks have recently been used for learning to describe images using
natural language. However, it has been observed that these models generalize poorly to …

被引用次数：74 相关文章所有 3 个版本

[PDF] thecvf.com

Winoground: Probing vision and language models for visio-linguistic compositionality

T Thrush, R Jiang, M Bartolo, A Singh… - Proceedings of the …, 2022 - openaccess.thecvf.com

We present a novel task and dataset for evaluating the ability of vision and language models
to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two …

被引用次数：267 相关文章所有 6 个版本

[PDF] thecvf.com

I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision

S Gu, C Clark, A Kembhavi - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Many high-level skills that are required for computer vision tasks, such as parsing questions,
comparing and contrasting semantics, and writing descriptions, are also required in other …

被引用次数：10 相关文章所有 3 个版本

[PDF] thecvf.com

Nlx-gpt: A model for natural language explanations in vision and vision-language tasks

F Sammani, T Mukherjee… - proceedings of the …, 2022 - openaccess.thecvf.com

Natural language explanation (NLE) models aim at explaining the decision-making process
of a black box system via generating natural language sentences which are human-friendly …

被引用次数：51 相关文章所有 8 个版本

高级搜索

QQ 群

A corpus for reasoning about natural language grounded in photographs

A corpus of natural language for visual reasoning

Natural language rationales with full-stack visual reasoning: From pixels to semantic frames to commonsense graphs

Multimodal graph networks for compositional generalization in visual question answering

Why is winoground hard? investigating failures in visuolinguistic compositionality

Visual entailment: A novel task for fine-grained image understanding

Learning to generalize to new compositions in image understanding

Winoground: Probing vision and language models for visio-linguistic compositionality

I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision

Nlx-gpt: A model for natural language explanations in vision and vision-language tasks

相关搜索

引用