Be different to be better! A benchmark to leverage the complementarity of language and vision

A Suglia, I Konstas, O Lemon - Journal of Artificial Intelligence Research, 2024 - jair.org

In recent years, several machine learning models have been proposed. They are trained
with a language modelling objective on large-scale text-only data. With such pretraining …

被引用次数：5 相关文章所有 6 个版本

[PDF] arxiv.org

The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning

J Hessel, JD Hwang, JS Park, R Zellers… - … on Computer Vision, 2022 - Springer

Humans have remarkable capacity to reason abductively and hypothesize about what lies
beyond the literal content of an image. By identifying concrete visual clues scattered …

被引用次数：41 相关文章所有 5 个版本

[PDF] arxiv.org

Dealing with semantic underspecification in multimodal NLP

S Pezzelle - arXiv preprint arXiv:2306.05240, 2023 - arxiv.org

Intelligent systems that aim at mastering language as humans do must deal with its semantic
underspecification, namely, the possibility for a linguistic signal to convey only part of the …

被引用次数：14 相关文章所有 5 个版本

[PDF] wiley.com

Linguistic issues behind visual question answering

R Bernardi, S Pezzelle - Language and Linguistics Compass, 2021 - Wiley Online Library

Answering a question that is grounded in an image is a crucial ability that requires
understanding the question, the visual context, and their interaction at many linguistic levels …

被引用次数：16 相关文章所有 10 个版本

[PDF] arxiv.org

Geoglue: A geographic language understanding evaluation benchmark

D Li, R Ding, Q Zhang, Z Li, B Chen, P Xie, Y Xu… - arXiv preprint arXiv …, 2023 - arxiv.org

With a fast developing pace of geographic applications, automatable and intelligent models
are essential to be designed to handle the large volume of information. However, few …

被引用次数：7 相关文章所有 2 个版本

[PDF] arxiv.org

CELLO: Causal Evaluation of Large Vision-Language Models

M Chen, B Peng, Y Zhang, C Lu - arXiv preprint arXiv:2406.19131, 2024 - arxiv.org

Causal reasoning is fundamental to human intelligence and crucial for effective decision-
making in real-world environments. Despite recent advancements in large vision-language …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

WhyAct: Identifying action reasons in lifestyle vlogs

O Ignat, S Castro, H Miao, W Li, R Mihalcea - arXiv preprint arXiv …, 2021 - arxiv.org

We aim to automatically identify human action reasons in online videos. We focus on the
widespread genre of lifestyle vlogs, in which people perform actions while verbally …

被引用次数：13 相关文章所有 6 个版本

[PDF] arxiv.org

What Vision-Language ModelsSee'when they See Scenes

M Cafagna, K van Deemter, A Gatt - arXiv preprint arXiv:2109.07301, 2021 - arxiv.org

Images can be described in terms of the objects they contain, or in terms of the types of
scene or place that they instantiate. In this paper we address to what extent pretrained …

被引用次数：11 相关文章所有 3 个版本

[PDF] arxiv.org

HL dataset: visually-grounded description of scenes, actions and rationales

M Cafagna, K van Deemter, A Gatt - arXiv preprint arXiv:2302.12189, 2023 - arxiv.org

Current captioning datasets focus on object-centric captions, describing the visible objects in
the image, eg" people eating food in a park". Although these datasets are useful to evaluate …

被引用次数：3 相关文章所有 6 个版本

[PDF] unitn.it

One Picture and a Thousand Words: Generative Language+ images Models and How to Train Them

R Zamparelli - CEUR WORKSHOP PROCEEDINGS, 2023 - iris.unitn.it

Thanks to independent advances in language and image generation we could soon be in
the position to have systems that communicate with humans by combining language and …

被引用次数：1 相关文章所有 4 个版本

高级搜索

QQ 群