相关文章- 学术资源搜索

Merlot: Multimodal neural script knowledge models

R Zellers, X Lu, J Hessel, Y Yu… - Advances in neural …, 2021 - proceedings.neurips.cc

As humans, we understand events in the visual world contextually, performing multimodal
reasoning across time to make inferences about the past, present, and future. We introduce …

被引用次数：348 相关文章所有 7 个版本

[PDF] thecvf.com

Merlot reserve: Neural script knowledge through vision and language and sound

R Zellers, J Lu, X Lu, Y Yu, Y Zhao… - Proceedings of the …, 2022 - openaccess.thecvf.com

As humans, we navigate a multimodal world, building a holistic understanding from all our
senses. We introduce MERLOT Reserve, a model that represents videos jointly over time …

被引用次数：213 相关文章所有 9 个版本

[PDF] arxiv.org

Visualcomet: Reasoning about the dynamic context of a still image

JS Park, C Bhagavatula, R Mottaghi, A Farhadi… - Computer Vision–ECCV …, 2020 - Springer

Even from a single frame of a still image, people can reason about the dynamic story of the
image before, after, and beyond the frame. For example, given an image of a man struggling …

被引用次数：120 相关文章所有 5 个版本

[PDF] thecvf.com

From recognition to cognition: Visual commonsense reasoning

R Zellers, Y Bisk, A Farhadi… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com

Visual understanding goes well beyond object recognition. With one glance at an image, we
can effortlessly imagine the world beyond the pixels: for instance, we can infer people's …

被引用次数：872 相关文章所有 7 个版本

[PDF] thecvf.com

Revisiting the" video" in video-language understanding

S Buch, C Eyzaguirre, A Gaidon, J Wu… - Proceedings of the …, 2022 - openaccess.thecvf.com

What makes a video task uniquely suited for videos, beyond what can be understood from a
single image? Building on recent progress in self-supervised image-language models, we …

被引用次数：120 相关文章所有 6 个版本

[PDF] neurips.cc

Heterogeneous graph learning for visual commonsense reasoning

W Yu, J Zhou, W Yu, X Liang… - Advances in Neural …, 2019 - proceedings.neurips.cc

Visual commonsense reasoning task aims at leading the research field into solving
cognition-level reasoning with the ability to predict correct answers and meanwhile …

被引用次数：51 相关文章所有 9 个版本

[PDF] arxiv.org

Broaden the vision: Geo-diverse visual commonsense reasoning

D Yin, LH Li, Z Hu, N Peng, KW Chang - arXiv preprint arXiv:2109.06860, 2021 - arxiv.org

Commonsense is defined as the knowledge that is shared by everyone. However, certain
types of commonsense knowledge are correlated with culture and geographic locations and …

被引用次数：37 相关文章所有 5 个版本

[PDF] aaai.org

A case study of the shortcut effects in visual commonsense reasoning

K Ye, A Kovashka - Proceedings of the AAAI conference on artificial …, 2021 - ojs.aaai.org

Visual reasoning and question-answering have gathered attention in recent years. Many
datasets and evaluation protocols have been proposed; some have been shown to contain …

被引用次数：43 相关文章所有 6 个版本

[PDF] arxiv.org

Natural language rationales with full-stack visual reasoning: From pixels to semantic frames to commonsense graphs

A Marasović, C Bhagavatula, JS Park, RL Bras… - arXiv preprint arXiv …, 2020 - arxiv.org

Natural language rationales could provide intuitive, higher-level explanations that are easily
understandable by humans, complementing the more broadly studied lower-level …

被引用次数：55 相关文章所有 3 个版本

[PDF] arxiv.org

Video-llama: An instruction-tuned audio-visual language model for video understanding

H Zhang, X Li, L Bing - arXiv preprint arXiv:2306.02858, 2023 - arxiv.org

We present Video-LLaMA, a multi-modal framework that empowers Large Language Models
(LLMs) with the capability of understanding both visual and auditory content in the video …

被引用次数：351 相关文章所有 3 个版本

高级搜索

QQ 群

Merlot: Multimodal neural script knowledge models

Merlot reserve: Neural script knowledge through vision and language and sound

Visualcomet: Reasoning about the dynamic context of a still image

From recognition to cognition: Visual commonsense reasoning

Revisiting the" video" in video-language understanding

Heterogeneous graph learning for visual commonsense reasoning

Broaden the vision: Geo-diverse visual commonsense reasoning

A case study of the shortcut effects in visual commonsense reasoning

Natural language rationales with full-stack visual reasoning: From pixels to semantic frames to commonsense graphs

Video-llama: An instruction-tuned audio-visual language model for video understanding

引用