The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models but there …
Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (eg, masked frame modeling) have …
Y Li, C Wang, J Jia - arXiv preprint arXiv:2311.17043, 2023 - arxiv.org
In this work, we present a novel method to tackle the token generation challenge in Vision Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current …
The canonical approach to video-and-language learning (eg, video question answering) dictates a neural model to learn from offline-extracted dense video features from vision …
Vision-language pre-training has significantly elevated performance across a wide range of image-language applications. Yet, the pre-training process for video-related tasks demands …
H Zhang, X Li, L Bing - arXiv preprint arXiv:2306.02858, 2023 - arxiv.org
We present Video-LLaMA, a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video …
What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we …
The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode …
We introduce EgoSchema, a very long-form video question-answering dataset, and benchmark to evaluate long video understanding capabilities of modern vision and …