Videocon: Robust video-language alignment via contrast captions

L Zhao, NB Gundavarapu, L Yuan, H Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce VideoPrism, a general-purpose video encoder that tackles diverse video
understanding tasks with a single frozen model. We pretrain VideoPrism on a …

被引用次数：6 相关文章所有 3 个版本

[PDF] arxiv.org

Towards Generalist Robot Learning from Internet Video: A Survey

R McCarthy, DCH Tan, D Schmidt, F Acero… - arXiv preprint arXiv …, 2024 - arxiv.org

This survey presents an overview of methods for learning from video (LfV) in the context of
reinforcement learning (RL) and robotics. We focus on methods capable of scaling to large …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

VideoPhy: Evaluating Physical Commonsense for Video Generation

H Bansal, Z Lin, T Xie, Z Zong, M Yarom… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent advances in internet-scale video data pretraining have led to the development of text-
to-video generative models that can create high-quality videos across a broad range of …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions?

B Xu, Z Wang, Y Du, S Zheng, Z Song, Q Jin - arXiv preprint arXiv …, 2024 - arxiv.org

Egocentric video-language pretraining is a crucial paradigm to advance the learning of
egocentric hand-object interactions (EgoHOI). Despite the great success on existing …

VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?

D Saravanan, D Singh, V Gupta, Z Khan… - arXiv preprint arXiv …, 2024 - arxiv.org

Compositionality is a fundamental aspect of vision-language understanding and is
especially required for videos since they contain multiple entities (eg persons, actions, and …

高级搜索

QQ 群