相关文章- 学术资源搜索

Lavender: Unifying video-language understanding as masked language modeling

L Li, Z Gan, K Lin, CC Lin, Z Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Unified vision-language frameworks have greatly advanced in recent years, most of which
adopt an encoder-decoder architecture to unify image-text tasks as sequence-to-sequence …

被引用次数：67 相关文章所有 6 个版本

[PDF] thecvf.com

Distilling vision-language models on millions of videos

Y Zhao, L Zhao, X Zhou, J Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com

The recent advance in vision-language models is largely attributed to the abundance of
image-text data. We aim to replicate this success for video-language models but there …

被引用次数：5 相关文章所有 4 个版本

[PDF] thecvf.com

An empirical study of end-to-end video-language transformers with masked visual modeling

TJ Fu, L Li, Z Gan, K Lin, WY Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Masked visual modeling (MVM) has been recently proven effective for visual pre-training.
While similar reconstructive objectives on video inputs (eg, masked frame modeling) have …

被引用次数：51 相关文章所有 8 个版本

[PDF] arxiv.org

Llama-vid: An image is worth 2 tokens in large language models

Y Li, C Wang, J Jia - arXiv preprint arXiv:2311.17043, 2023 - arxiv.org

In this work, we present a novel method to tackle the token generation challenge in Vision
Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current …

被引用次数：54 相关文章所有 2 个版本

[PDF] thecvf.com

Less is more: Clipbert for video-and-language learning via sparse sampling

J Lei, L Li, L Zhou, Z Gan, TL Berg… - Proceedings of the …, 2021 - openaccess.thecvf.com

The canonical approach to video-and-language learning (eg, video question answering)
dictates a neural model to learn from offline-extracted dense video features from vision …

被引用次数：632 相关文章所有 8 个版本

[PDF] arxiv.org

Pllava: Parameter-free llava extension from images to videos for video dense captioning

L Xu, Y Zhao, D Zhou, Z Lin, SK Ng, J Feng - arXiv preprint arXiv …, 2024 - arxiv.org

Vision-language pre-training has significantly elevated performance across a wide range of
image-language applications. Yet, the pre-training process for video-related tasks demands …

被引用次数：13 相关文章所有 2 个版本

[PDF] arxiv.org

Video-llama: An instruction-tuned audio-visual language model for video understanding

H Zhang, X Li, L Bing - arXiv preprint arXiv:2306.02858, 2023 - arxiv.org

We present Video-LLaMA, a multi-modal framework that empowers Large Language Models
(LLMs) with the capability of understanding both visual and auditory content in the video …

被引用次数：351 相关文章所有 3 个版本

[PDF] thecvf.com

Revisiting the" video" in video-language understanding

S Buch, C Eyzaguirre, A Gaidon, J Wu… - Proceedings of the …, 2022 - openaccess.thecvf.com

What makes a video task uniquely suited for videos, beyond what can be understood from a
single image? Building on recent progress in self-supervised image-language models, we …

被引用次数：120 相关文章所有 6 个版本

[PDF] arxiv.org

Video-llava: Learning united visual representation by alignment before projection

B Lin, B Zhu, Y Ye, M Ning, P Jin, L Yuan - arXiv preprint arXiv:2311.10122, 2023 - arxiv.org

The Large Vision-Language Model (LVLM) has enhanced the performance of various
downstream tasks in visual-language understanding. Most existing approaches encode …

被引用次数：124 相关文章所有 3 个版本

[PDF] neurips.cc

Egoschema: A diagnostic benchmark for very long-form video language understanding

K Mangalam, R Akshulakov… - Advances in Neural …, 2024 - proceedings.neurips.cc

We introduce EgoSchema, a very long-form video question-answering dataset, and
benchmark to evaluate long video understanding capabilities of modern vision and …

被引用次数：50 相关文章所有 5 个版本

高级搜索

QQ 群

Lavender: Unifying video-language understanding as masked language modeling

Distilling vision-language models on millions of videos

An empirical study of end-to-end video-language transformers with masked visual modeling

Llama-vid: An image is worth 2 tokens in large language models

Less is more: Clipbert for video-and-language learning via sparse sampling

Pllava: Parameter-free llava extension from images to videos for video dense captioning

Video-llama: An instruction-tuned audio-visual language model for video understanding

Revisiting the" video" in video-language understanding

Video-llava: Learning united visual representation by alignment before projection

Egoschema: A diagnostic benchmark for very long-form video language understanding

相关搜索

引用