Visually Grounded Language Learning: a review of language games, datasets, tasks, and models

A Suglia, I Konstas, O Lemon - Journal of Artificial Intelligence Research, 2024 - jair.org
In recent years, several machine learning models have been proposed. They are trained
with a language modelling objective on large-scale text-only data. With such pretraining …

The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning

J Hessel, JD Hwang, JS Park, R Zellers… - … on Computer Vision, 2022 - Springer
Humans have remarkable capacity to reason abductively and hypothesize about what lies
beyond the literal content of an image. By identifying concrete visual clues scattered …

Dealing with semantic underspecification in multimodal NLP

S Pezzelle - arXiv preprint arXiv:2306.05240, 2023 - arxiv.org
Intelligent systems that aim at mastering language as humans do must deal with its semantic
underspecification, namely, the possibility for a linguistic signal to convey only part of the …

Linguistic issues behind visual question answering

R Bernardi, S Pezzelle - Language and Linguistics Compass, 2021 - Wiley Online Library
Answering a question that is grounded in an image is a crucial ability that requires
understanding the question, the visual context, and their interaction at many linguistic levels …

Geoglue: A geographic language understanding evaluation benchmark

D Li, R Ding, Q Zhang, Z Li, B Chen, P Xie, Y Xu… - arXiv preprint arXiv …, 2023 - arxiv.org
With a fast developing pace of geographic applications, automatable and intelligent models
are essential to be designed to handle the large volume of information. However, few …

CELLO: Causal Evaluation of Large Vision-Language Models

M Chen, B Peng, Y Zhang, C Lu - arXiv preprint arXiv:2406.19131, 2024 - arxiv.org
Causal reasoning is fundamental to human intelligence and crucial for effective decision-
making in real-world environments. Despite recent advancements in large vision-language …

WhyAct: Identifying action reasons in lifestyle vlogs

O Ignat, S Castro, H Miao, W Li, R Mihalcea - arXiv preprint arXiv …, 2021 - arxiv.org
We aim to automatically identify human action reasons in online videos. We focus on the
widespread genre of lifestyle vlogs, in which people perform actions while verbally …

What Vision-Language ModelsSee'when they See Scenes

M Cafagna, K van Deemter, A Gatt - arXiv preprint arXiv:2109.07301, 2021 - arxiv.org
Images can be described in terms of the objects they contain, or in terms of the types of
scene or place that they instantiate. In this paper we address to what extent pretrained …

HL dataset: visually-grounded description of scenes, actions and rationales

M Cafagna, K van Deemter, A Gatt - arXiv preprint arXiv:2302.12189, 2023 - arxiv.org
Current captioning datasets focus on object-centric captions, describing the visible objects in
the image, eg" people eating food in a park". Although these datasets are useful to evaluate …

One Picture and a Thousand Words: Generative Language+ images Models and How to Train Them

R Zamparelli - CEUR WORKSHOP PROCEEDINGS, 2023 - iris.unitn.it
Thanks to independent advances in language and image generation we could soon be in
the position to have systems that communicate with humans by combining language and …