Videoglue: Video general understanding evaluation of foundation models

TA To, MN Tran, TB Ho, TL Ha… - Proceedings of the …, 2024 - openaccess.thecvf.com

The analysis of traffic patterns is crucial for enhancing safety and optimizing flow within
urban cities. While urban cities possess extensive camera networks for monitoring the raw …

被引用次数：1 相关文章

[PDF] arxiv.org

Videoprism: A foundational visual encoder for video understanding

L Zhao, NB Gundavarapu, L Yuan, H Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce VideoPrism, a general-purpose video encoder that tackles diverse video
understanding tasks with a single frozen model. We pretrain VideoPrism on a …

被引用次数：6 相关文章所有 3 个版本

[PDF] arxiv.org

Revisiting feature prediction for learning visual representations from video

A Bardes, Q Garrido, J Ponce, X Chen… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper explores feature prediction as a stand-alone objective for unsupervised learning
from video and introduces V-JEPA, a collection of vision models trained solely using a …

被引用次数：9 相关文章所有 2 个版本

[PDF] openreview.net

V-jepa: Latent video prediction for visual representation learning

A Bardes, Q Garrido, J Ponce, X Chen, M Rabbat… - 2023 - openreview.net

This paper shows that the masked-modelling principle driving the success of large
foundational language models can be effectively applied to video by making predictions in …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model

X Li, Z Huang, J Wang, K Li, L Wang - arXiv preprint arXiv:2407.06491, 2024 - arxiv.org

With the growth of high-quality data and advancement in visual pre-training paradigms,
Video Foundation Models (VFMs) have made significant progress recently, demonstrating …

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives

T Nguyen, Y Bin, J Xiao, L Qu, Y Li, JZ Wu… - arXiv preprint arXiv …, 2024 - arxiv.org

Humans use multiple senses to comprehend the environment. Vision and language are two
of the most vital senses since they allow us to easily communicate our thoughts and …

Foundation Models for Video Understanding: A Survey

N Madan, A Møgelmose, R Modi, YS Rawat… - arXiv preprint arXiv …, 2024 - arxiv.org

Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various
video understanding tasks. Leveraging large-scale datasets and powerful models, ViFMs …

被引用次数：3 相关文章所有 5 个版本

General Purpose and Interactive Video Analytics

FAR Llamas - 2023 - search.proquest.com

The proliferation of video collections and the increased capabilities of machine learning
models have led to a growing desire for video analytics—the process of extracting insights …

[PDF] github.io

[PDF][PDF] LVS: A Learned Video Storage for Fast and Efficient Video Understanding

Y Lee, J Park - jongse-park.github.io

As video understanding (VU) promises unprecedented capabilities in the era of video data
explosion, its efficient computation plays a critical role in practicalizing the algorithmic …

高级搜索

QQ 群