Videoprism: A foundational visual encoder for video understanding

L Zhao, NB Gundavarapu, L Yuan, H Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce VideoPrism, a general-purpose video encoder that tackles diverse video
understanding tasks with a single frozen model. We pretrain VideoPrism on a …

Towards Generalist Robot Learning from Internet Video: A Survey

R McCarthy, DCH Tan, D Schmidt, F Acero… - arXiv preprint arXiv …, 2024 - arxiv.org
This survey presents an overview of methods for learning from video (LfV) in the context of
reinforcement learning (RL) and robotics. We focus on methods capable of scaling to large …

VideoPhy: Evaluating Physical Commonsense for Video Generation

H Bansal, Z Lin, T Xie, Z Zong, M Yarom… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advances in internet-scale video data pretraining have led to the development of text-
to-video generative models that can create high-quality videos across a broad range of …

EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions?

B Xu, Z Wang, Y Du, S Zheng, Z Song, Q Jin - arXiv preprint arXiv …, 2024 - arxiv.org
Egocentric video-language pretraining is a crucial paradigm to advance the learning of
egocentric hand-object interactions (EgoHOI). Despite the great success on existing …

VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?

D Saravanan, D Singh, V Gupta, Z Khan… - arXiv preprint arXiv …, 2024 - arxiv.org
Compositionality is a fundamental aspect of vision-language understanding and is
especially required for videos since they contain multiple entities (eg persons, actions, and …