Videoprism: A foundational visual encoder for video understanding

T Hummel, S Karthik, MI Georgescu, Z Akata - arXiv preprint arXiv …, 2024 - arxiv.org

In Composed Video Retrieval, a video and a textual description which modifies the video
content are provided as inputs to the model. The aim is to retrieve the relevant video with the …

相关文章所有 2 个版本

[PDF] arxiv.org

VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model

X Li, Z Huang, J Wang, K Li, L Wang - arXiv preprint arXiv:2407.06491, 2024 - arxiv.org

With the growth of high-quality data and advancement in visual pre-training paradigms,
Video Foundation Models (VFMs) have made significant progress recently, demonstrating …

相关文章所有 2 个版本

[PDF] arxiv.org

Localizing Events in Videos with Multimodal Queries

G Zhang, MLA Fok, Y Xia, Y Tang, D Cremers… - arXiv preprint arXiv …, 2024 - arxiv.org

Video understanding is a pivotal task in the digital era, yet the dynamic and multievent
nature of videos makes them labor-intensive and computationally demanding to process …

[PDF] arxiv.org

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

Y Guo, J Liu, M Li, X Tang, X Chen, B Zhao - arXiv preprint arXiv …, 2024 - arxiv.org

Video Temporal Grounding (VTG) focuses on accurately identifying event timestamps within
a particular video based on a linguistic query, playing a vital role in downstream tasks such …

相关文章所有 2 个版本

[PDF] arxiv.org

Foundation Models for Video Understanding: A Survey

N Madan, A Møgelmose, R Modi, YS Rawat… - arXiv preprint arXiv …, 2024 - arxiv.org

Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various
video understanding tasks. Leveraging large-scale datasets and powerful models, ViFMs …

被引用次数：3 相关文章所有 5 个版本

[PDF] arxiv.org

Tarsier: Recipes for Training and Evaluating Large Video Description Models

J Wang, L Yuan, Y Zhang - arXiv preprint arXiv:2407.00634, 2024 - arxiv.org

Generating fine-grained video descriptions is a fundamental challenge in video
understanding. In this work, we introduce Tarsier, a family of large-scale video-language …

相关文章所有 2 个版本

[PDF] biorxiv.org

Video Foundation Models for Animal Behavior Analysis

JJ Sun, H Zhou, L Zhao, L Yuan, B Seybold, D Hendon… - bioRxiv, 2024 - biorxiv.org

Computational approaches leveraging computer vision and machine learning have
transformed the quantification of animal behavior from video. However, existing methods …

高级搜索

QQ 群