Videoprism: A foundational visual encoder for video understanding

L Zhao, NB Gundavarapu, L Yuan, H Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce VideoPrism, a general-purpose video encoder that tackles diverse video
understanding tasks with a single frozen model. We pretrain VideoPrism on a …

Video Question Answering: A survey of the state-of-the-art

PJ Jeshmol, BC Kovoor - Journal of Visual Communication and Image …, 2024 - Elsevier
Abstract Video Question Answering (VideoQA) emerges as a prominent trend in the domain
of Artificial Intelligence, Computer Vision, and Natural Language Processing. It involves …

Real3D: Scaling Up Large Reconstruction Models with Real-World Images

H Jiang, Q Huang, G Pavlakos - arXiv preprint arXiv:2406.08479, 2024 - arxiv.org
The default strategy for training single-view Large Reconstruction Models (LRMs) follows the
fully supervised route using large-scale datasets of synthetic 3D assets or multi-view …

MotionLLM: Understanding Human Behaviors from Human Motions and Videos

LH Chen, S Lu, A Zeng, H Zhang, B Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
This study delves into the realm of multi-modality (ie, video and motion modalities) human
behavior understanding by leveraging the powerful capabilities of Large Language Models …

Apollo: An Exploration of Video Understanding in Large Multimodal Models

O Zohar, X Wang, Y Dubois, N Mehta, T Xiao… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …

MoS2: Mixture of Scale and Shift Experts for Text-Only Video Captioning

H Jia, Y Xu, L Zhu, G Chen, Y Wang… - Proceedings of the 32nd …, 2024 - dl.acm.org
Video captioning is a challenging task and typically requires paired video-text data for
training. However, manually annotating coherent textual descriptions for videos is laborious …

MERLIN: Multimodal Embedding Refinement via LLM-based Iterative Navigation for Text-Video Retrieval-Rerank Pipeline

D Han, E Park, G Lee, A Lee, N Kwak - arXiv preprint arXiv:2407.12508, 2024 - arxiv.org
The rapid expansion of multimedia content has made accurately retrieving relevant videos
from large collections increasingly challenging. Recent advancements in text-video retrieval …

[HTML][HTML] Improving semantic video retrieval models by training with a relevance-aware online mining strategy

A Falcon, G Serra, O Lanz - Computer Vision and Image Understanding, 2024 - Elsevier
To retrieve a video via a multimedia search engine, a textual query is usually created by the
user and then used to perform the search. Recent state-of-the-art cross-modal retrieval …

Foundation Models for Video Understanding: A Survey

N Madan, A Møgelmose, R Modi, YS Rawat… - arXiv preprint arXiv …, 2024 - arxiv.org
Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various
video understanding tasks. Leveraging large-scale datasets and powerful models, ViFMs …

Video Foundation Models for Animal Behavior Analysis

JJ Sun, H Zhou, L Zhao, L Yuan, B Seybold, D Hendon… - bioRxiv, 2024 - biorxiv.org
Computational approaches leveraging computer vision and machine learning have
transformed the quantification of animal behavior from video. However, existing methods …