Merlot reserve: Neural script knowledge through vision and language and sound

R Zellers, J Lu, X Lu, Y Yu, Y Zhao… - Proceedings of the …, 2022 - openaccess.thecvf.com
As humans, we navigate a multimodal world, building a holistic understanding from all our
senses. We introduce MERLOT Reserve, a model that represents videos jointly over time …

Merlot: Multimodal neural script knowledge models

R Zellers, X Lu, J Hessel, Y Yu… - Advances in …, 2021 - proceedings.neurips.cc
As humans, we understand events in the visual world contextually, performing multimodal
reasoning across time to make inferences about the past, present, and future. We introduce …

Violin: A large-scale dataset for video-and-language inference

J Liu, W Chen, Y Cheng, Z Gan, L Yu… - Proceedings of the …, 2020 - openaccess.thecvf.com
We introduce a new task, Video-and-Language Inference, for joint multimodal
understanding of video and text. Given a video clip with aligned subtitles as premise, paired …

Chat-univi: Unified visual representation empowers large language models with image and video understanding

P Jin, R Takanobu, W Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Large language models have demonstrated impressive universal capabilities across a wide
range of open-ended tasks and have extended their utility to encompass multimodal …

From representation to reasoning: Towards both evidence and commonsense reasoning for video question-answering

J Li, L Niu, L Zhang - … of the IEEE/CVF conference on …, 2022 - openaccess.thecvf.com
Video understanding has achieved great success in representation learning, such as video
caption, video object grounding, and video descriptive question-answer. However, current …

Valley: Video assistant with large language model enhanced ability

R Luo, Z Zhao, M Yang, J Dong, M Qiu, P Lu… - arXiv preprint arXiv …, 2023 - arxiv.org
Recently, several multi-modal models have been developed for joint image and language
understanding, which have demonstrated impressive chat abilities by utilizing advanced …

From recognition to cognition: Visual commonsense reasoning

R Zellers, Y Bisk, A Farhadi… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
Visual understanding goes well beyond object recognition. With one glance at an image, we
can effortlessly imagine the world beyond the pixels: for instance, we can infer people's …

Revisiting the" video" in video-language understanding

S Buch, C Eyzaguirre, A Gaidon, J Wu… - Proceedings of the …, 2022 - openaccess.thecvf.com
What makes a video task uniquely suited for videos, beyond what can be understood from a
single image? Building on recent progress in self-supervised image-language models, we …

Km-bart: Knowledge enhanced multimodal bart for visual commonsense generation

Y Xing, Z Shi, Z Meng, G Lakemeyer, Y Ma… - arXiv preprint arXiv …, 2021 - arxiv.org
We present Knowledge Enhanced Multimodal BART (KM-BART), which is a Transformer-
based sequence-to-sequence model capable of reasoning about commonsense knowledge …

Lavender: Unifying video-language understanding as masked language modeling

L Li, Z Gan, K Lin, CC Lin, Z Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Unified vision-language frameworks have greatly advanced in recent years, most of which
adopt an encoder-decoder architecture to unify image-text tasks as sequence-to-sequence …