Lavender: Unifying video-language understanding as masked language modeling

L Li, Z Gan, K Lin, CC Lin, Z Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Unified vision-language frameworks have greatly advanced in recent years, most of which
adopt an encoder-decoder architecture to unify image-text tasks as sequence-to-sequence …

Distilling vision-language models on millions of videos

Y Zhao, L Zhao, X Zhou, J Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com
The recent advance in vision-language models is largely attributed to the abundance of
image-text data. We aim to replicate this success for video-language models but there …

An empirical study of end-to-end video-language transformers with masked visual modeling

TJ Fu, L Li, Z Gan, K Lin, WY Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Masked visual modeling (MVM) has been recently proven effective for visual pre-training.
While similar reconstructive objectives on video inputs (eg, masked frame modeling) have …

Llama-vid: An image is worth 2 tokens in large language models

Y Li, C Wang, J Jia - arXiv preprint arXiv:2311.17043, 2023 - arxiv.org
In this work, we present a novel method to tackle the token generation challenge in Vision
Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current …

Less is more: Clipbert for video-and-language learning via sparse sampling

J Lei, L Li, L Zhou, Z Gan, TL Berg… - Proceedings of the …, 2021 - openaccess.thecvf.com
The canonical approach to video-and-language learning (eg, video question answering)
dictates a neural model to learn from offline-extracted dense video features from vision …

Pllava: Parameter-free llava extension from images to videos for video dense captioning

L Xu, Y Zhao, D Zhou, Z Lin, SK Ng, J Feng - arXiv preprint arXiv …, 2024 - arxiv.org
Vision-language pre-training has significantly elevated performance across a wide range of
image-language applications. Yet, the pre-training process for video-related tasks demands …

Video-llama: An instruction-tuned audio-visual language model for video understanding

H Zhang, X Li, L Bing - arXiv preprint arXiv:2306.02858, 2023 - arxiv.org
We present Video-LLaMA, a multi-modal framework that empowers Large Language Models
(LLMs) with the capability of understanding both visual and auditory content in the video …

Revisiting the" video" in video-language understanding

S Buch, C Eyzaguirre, A Gaidon, J Wu… - Proceedings of the …, 2022 - openaccess.thecvf.com
What makes a video task uniquely suited for videos, beyond what can be understood from a
single image? Building on recent progress in self-supervised image-language models, we …

Video-llava: Learning united visual representation by alignment before projection

B Lin, B Zhu, Y Ye, M Ning, P Jin, L Yuan - arXiv preprint arXiv:2311.10122, 2023 - arxiv.org
The Large Vision-Language Model (LVLM) has enhanced the performance of various
downstream tasks in visual-language understanding. Most existing approaches encode …

Egoschema: A diagnostic benchmark for very long-form video language understanding

K Mangalam, R Akshulakov… - Advances in Neural …, 2024 - proceedings.neurips.cc
We introduce EgoSchema, a very long-form video question-answering dataset, and
benchmark to evaluate long video understanding capabilities of modern vision and …