相关文章- 学术资源搜索

Revisiting the" video" in video-language understanding

S Buch, C Eyzaguirre, A Gaidon, J Wu… - Proceedings of the …, 2022 - openaccess.thecvf.com

What makes a video task uniquely suited for videos, beyond what can be understood from a
single image? Building on recent progress in self-supervised image-language models, we …

被引用次数：120 相关文章所有 6 个版本

[PDF] thecvf.com

Violin: A large-scale dataset for video-and-language inference

J Liu, W Chen, Y Cheng, Z Gan, L Yu… - Proceedings of the …, 2020 - openaccess.thecvf.com

We introduce a new task, Video-and-Language Inference, for joint multimodal
understanding of video and text. Given a video clip with aligned subtitles as premise, paired …

被引用次数：70 相关文章所有 6 个版本

[PDF] thecvf.com

Clover: Towards a unified video-language alignment and fusion model

J Huang, Y Li, J Feng, X Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Building a universal video-language model for solving various video understanding tasks
(eg, text-video retrieval, video question answering) is an open challenge to the machine …

被引用次数：46 相关文章所有 5 个版本

[PDF] thecvf.com

Test of time: Instilling video-language models with a sense of time

P Bagad, M Tapaswi… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Modelling and understanding time remains a challenge in contemporary video
understanding models. With language emerging as a key driver towards powerful …

被引用次数：23 相关文章所有 9 个版本

[PDF] arxiv.org

Video-llama: An instruction-tuned audio-visual language model for video understanding

H Zhang, X Li, L Bing - arXiv preprint arXiv:2306.02858, 2023 - arxiv.org

We present Video-LLaMA, a multi-modal framework that empowers Large Language Models
(LLMs) with the capability of understanding both visual and auditory content in the video …

被引用次数：351 相关文章所有 3 个版本

[PDF] arxiv.org

Videollm: Modeling video sequence with large language models

G Chen, YD Zheng, J Wang, J Xu, Y Huang… - arXiv preprint arXiv …, 2023 - arxiv.org

With the exponential growth of video data, there is an urgent need for automated technology
to analyze and comprehend video content. However, existing video understanding models …

被引用次数：53 相关文章所有 2 个版本

[PDF] arxiv.org

Vlm: Task-agnostic video-language model pre-training for video understanding

H Xu, G Ghosh, PY Huang, P Arora… - arXiv preprint arXiv …, 2021 - arxiv.org

We present a simplified, task-agnostic multi-modal pre-training approach that can accept
either video or text input, or both for a variety of end tasks. Existing pre-training are task …

被引用次数：121 相关文章所有 5 个版本

[PDF] arxiv.org

Prompting visual-language models for efficient video understanding

C Ju, T Han, K Zheng, Y Zhang, W Xie - European Conference on …, 2022 - Springer

Image-based visual-language (I-VL) pre-training has shown great success for learning joint
visual-textual representations from large-scale web data, revealing remarkable ability for …

被引用次数：284 相关文章所有 6 个版本

[PDF] arxiv.org

Temporal perceiving video-language pre-training

F Ma, X Jin, H Wang, J Huang, L Zhu, J Feng… - arXiv preprint arXiv …, 2023 - arxiv.org

Video-Language Pre-training models have recently significantly improved various multi-
modal downstream tasks. Previous dominant works mainly adopt contrastive learning to …

被引用次数：16 相关文章所有 2 个版本

[PDF] thecvf.com

Lavender: Unifying video-language understanding as masked language modeling

L Li, Z Gan, K Lin, CC Lin, Z Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Unified vision-language frameworks have greatly advanced in recent years, most of which
adopt an encoder-decoder architecture to unify image-text tasks as sequence-to-sequence …

被引用次数：67 相关文章所有 6 个版本

高级搜索

QQ 群

Revisiting the" video" in video-language understanding

Violin: A large-scale dataset for video-and-language inference

Clover: Towards a unified video-language alignment and fusion model

Test of time: Instilling video-language models with a sense of time

Video-llama: An instruction-tuned audio-visual language model for video understanding

Videollm: Modeling video sequence with large language models

Vlm: Task-agnostic video-language model pre-training for video understanding

Prompting visual-language models for efficient video understanding

Temporal perceiving video-language pre-training

Lavender: Unifying video-language understanding as masked language modeling

相关搜索

引用