相关文章- 学术资源搜索

Merlot reserve: Neural script knowledge through vision and language and sound

R Zellers, J Lu, X Lu, Y Yu, Y Zhao… - Proceedings of the …, 2022 - openaccess.thecvf.com

As humans, we navigate a multimodal world, building a holistic understanding from all our
senses. We introduce MERLOT Reserve, a model that represents videos jointly over time …

被引用次数：208 相关文章所有 11 个版本

[PDF] neurips.cc

Merlot: Multimodal neural script knowledge models

R Zellers, X Lu, J Hessel, Y Yu… - Advances in …, 2021 - proceedings.neurips.cc

As humans, we understand events in the visual world contextually, performing multimodal
reasoning across time to make inferences about the past, present, and future. We introduce …

被引用次数：348 相关文章所有 8 个版本

[PDF] thecvf.com

Violin: A large-scale dataset for video-and-language inference

J Liu, W Chen, Y Cheng, Z Gan, L Yu… - Proceedings of the …, 2020 - openaccess.thecvf.com

We introduce a new task, Video-and-Language Inference, for joint multimodal
understanding of video and text. Given a video clip with aligned subtitles as premise, paired …

被引用次数：70 相关文章所有 8 个版本

[PDF] thecvf.com

Chat-univi: Unified visual representation empowers large language models with image and video understanding

P Jin, R Takanobu, W Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Large language models have demonstrated impressive universal capabilities across a wide
range of open-ended tasks and have extended their utility to encompass multimodal …

被引用次数：27 相关文章所有 3 个版本

[PDF] thecvf.com

From representation to reasoning: Towards both evidence and commonsense reasoning for video question-answering

J Li, L Niu, L Zhang - … of the IEEE/CVF conference on …, 2022 - openaccess.thecvf.com

Video understanding has achieved great success in representation learning, such as video
caption, video object grounding, and video descriptive question-answer. However, current …

被引用次数：39 相关文章所有 7 个版本

[PDF] arxiv.org

Valley: Video assistant with large language model enhanced ability

R Luo, Z Zhao, M Yang, J Dong, M Qiu, P Lu… - arXiv preprint arXiv …, 2023 - arxiv.org

Recently, several multi-modal models have been developed for joint image and language
understanding, which have demonstrated impressive chat abilities by utilizing advanced …

被引用次数：83 相关文章所有 3 个版本

[PDF] thecvf.com

From recognition to cognition: Visual commonsense reasoning

R Zellers, Y Bisk, A Farhadi… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com

Visual understanding goes well beyond object recognition. With one glance at an image, we
can effortlessly imagine the world beyond the pixels: for instance, we can infer people's …

被引用次数：873 相关文章所有 8 个版本

[PDF] thecvf.com

Revisiting the" video" in video-language understanding

S Buch, C Eyzaguirre, A Gaidon, J Wu… - Proceedings of the …, 2022 - openaccess.thecvf.com

What makes a video task uniquely suited for videos, beyond what can be understood from a
single image? Building on recent progress in self-supervised image-language models, we …

被引用次数：120 相关文章所有 8 个版本

[PDF] arxiv.org

Km-bart: Knowledge enhanced multimodal bart for visual commonsense generation

Y Xing, Z Shi, Z Meng, G Lakemeyer, Y Ma… - arXiv preprint arXiv …, 2021 - arxiv.org

We present Knowledge Enhanced Multimodal BART (KM-BART), which is a Transformer-
based sequence-to-sequence model capable of reasoning about commonsense knowledge …

被引用次数：43 相关文章所有 14 个版本

[PDF] thecvf.com

Lavender: Unifying video-language understanding as masked language modeling

L Li, Z Gan, K Lin, CC Lin, Z Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Unified vision-language frameworks have greatly advanced in recent years, most of which
adopt an encoder-decoder architecture to unify image-text tasks as sequence-to-sequence …

被引用次数：67 相关文章所有 5 个版本

高级搜索

QQ 群

Merlot reserve: Neural script knowledge through vision and language and sound

Merlot: Multimodal neural script knowledge models

Violin: A large-scale dataset for video-and-language inference

Chat-univi: Unified visual representation empowers large language models with image and video understanding

From representation to reasoning: Towards both evidence and commonsense reasoning for video question-answering

Valley: Video assistant with large language model enhanced ability

From recognition to cognition: Visual commonsense reasoning

Revisiting the" video" in video-language understanding

Km-bart: Knowledge enhanced multimodal bart for visual commonsense generation

Lavender: Unifying video-language understanding as masked language modeling

引用