We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. We pretrain VideoPrism on a …
This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a …
This paper shows that the masked-modelling principle driving the success of large foundational language models can be effectively applied to video by making predictions in …
X Li, Z Huang, J Wang, K Li, L Wang - arXiv preprint arXiv:2407.06491, 2024 - arxiv.org
With the growth of high-quality data and advancement in visual pre-training paradigms, Video Foundation Models (VFMs) have made significant progress recently, demonstrating …
Humans use multiple senses to comprehend the environment. Vision and language are two of the most vital senses since they allow us to easily communicate our thoughts and …
Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various video understanding tasks. Leveraging large-scale datasets and powerful models, ViFMs …
The proliferation of video collections and the increased capabilities of machine learning models have led to a growing desire for video analytics—the process of extracting insights …
As video understanding (VU) promises unprecedented capabilities in the era of video data explosion, its efficient computation plays a critical role in practicalizing the algorithmic …