Merlot: Multimodal neural script knowledge models

R Zellers, X Lu, J Hessel, Y Yu… - Advances in neural …, 2021 - proceedings.neurips.cc
As humans, we understand events in the visual world contextually, performing multimodal
reasoning across time to make inferences about the past, present, and future. We introduce …

Merlot reserve: Neural script knowledge through vision and language and sound

R Zellers, J Lu, X Lu, Y Yu, Y Zhao… - Proceedings of the …, 2022 - openaccess.thecvf.com
As humans, we navigate a multimodal world, building a holistic understanding from all our
senses. We introduce MERLOT Reserve, a model that represents videos jointly over time …

Visualcomet: Reasoning about the dynamic context of a still image

JS Park, C Bhagavatula, R Mottaghi, A Farhadi… - Computer Vision–ECCV …, 2020 - Springer
Even from a single frame of a still image, people can reason about the dynamic story of the
image before, after, and beyond the frame. For example, given an image of a man struggling …

From recognition to cognition: Visual commonsense reasoning

R Zellers, Y Bisk, A Farhadi… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
Visual understanding goes well beyond object recognition. With one glance at an image, we
can effortlessly imagine the world beyond the pixels: for instance, we can infer people's …

Revisiting the" video" in video-language understanding

S Buch, C Eyzaguirre, A Gaidon, J Wu… - Proceedings of the …, 2022 - openaccess.thecvf.com
What makes a video task uniquely suited for videos, beyond what can be understood from a
single image? Building on recent progress in self-supervised image-language models, we …

Heterogeneous graph learning for visual commonsense reasoning

W Yu, J Zhou, W Yu, X Liang… - Advances in Neural …, 2019 - proceedings.neurips.cc
Visual commonsense reasoning task aims at leading the research field into solving
cognition-level reasoning with the ability to predict correct answers and meanwhile …

Broaden the vision: Geo-diverse visual commonsense reasoning

D Yin, LH Li, Z Hu, N Peng, KW Chang - arXiv preprint arXiv:2109.06860, 2021 - arxiv.org
Commonsense is defined as the knowledge that is shared by everyone. However, certain
types of commonsense knowledge are correlated with culture and geographic locations and …

A case study of the shortcut effects in visual commonsense reasoning

K Ye, A Kovashka - Proceedings of the AAAI conference on artificial …, 2021 - ojs.aaai.org
Visual reasoning and question-answering have gathered attention in recent years. Many
datasets and evaluation protocols have been proposed; some have been shown to contain …

Natural language rationales with full-stack visual reasoning: From pixels to semantic frames to commonsense graphs

A Marasović, C Bhagavatula, JS Park, RL Bras… - arXiv preprint arXiv …, 2020 - arxiv.org
Natural language rationales could provide intuitive, higher-level explanations that are easily
understandable by humans, complementing the more broadly studied lower-level …

Video-llama: An instruction-tuned audio-visual language model for video understanding

H Zhang, X Li, L Bing - arXiv preprint arXiv:2306.02858, 2023 - arxiv.org
We present Video-LLaMA, a multi-modal framework that empowers Large Language Models
(LLMs) with the capability of understanding both visual and auditory content in the video …