Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Finematch: Aspect-based fine-grained image and text mismatch detection and correction

H Hua, J Shi, K Kafle, S Jenni, D Zhang… - … on Computer Vision, 2025 - Springer
Recent progress in large-scale pre-training has led to the development of advanced vision-
language models (VLMs) with remarkable proficiency in comprehending and generating …

Do language models understand time?

X Ding, L Wang - arXiv preprint arXiv:2412.13845, 2024 - arxiv.org
Large language models (LLMs) have revolutionized video-based computer vision
applications, including action recognition, anomaly detection, and video summarization …

EAGLE: Egocentric AGgregated Language-video Engine

J Bi, Y Tang, L Song, A Vosoughi, N Nguyen… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid evolution of egocentric video analysis brings new insights into understanding
human activities and intentions from a first-person perspective. Despite this progress, the …

Cardiff: Video salient object ranking chain of thought reasoning for saliency prediction with diffusion

Y Tang, G Zhan, L Yang, Y Liao, C Xu - arXiv preprint arXiv:2408.12009, 2024 - arxiv.org
Video saliency prediction aims to identify the regions in a video that attract human attention
and gaze, driven by bottom-up features from the video and top-down processes like memory …

Trace: Temporal grounding video llm via causal event modeling

Y Guo, J Liu, M Li, X Tang, Q Liu, X Chen - arXiv preprint arXiv …, 2024 - arxiv.org
Video Temporal Grounding (VTG) is a crucial capability for video understanding models and
plays a vital role in downstream tasks such as video browsing and editing. To effectively …

EGGesture: Entropy-Guided Vector Quantized Variational AutoEncoder for Co-Speech Gesture Generation

Y Xiao, K Shu, H Zhang, B Yin, WS Cheang… - Proceedings of the …, 2024 - dl.acm.org
Co-Speech gesture generation encounters challenges with imbalanced, long-tailed gesture
distributions. While recent methods typically address this by employing Vector Quantized …

MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

H Hua, Y Tang, Z Zeng, L Cao, Z Yang, H He… - arXiv preprint arXiv …, 2024 - arxiv.org
The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal
understanding, enabling more sophisticated and accurate integration of visual and textual …

Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation

L He, Y Song, H Huang, D Aliaga, X Zhou - arXiv preprint arXiv …, 2024 - arxiv.org
Text-to-video generation has been dominated by end-to-end diffusion-based or
autoregressive models. On one hand, those novel models provide plausible versatility, but …

STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training

H Qiu, M Gao, L Qian, K Pan, Q Yu, J Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Video Large Language Models (Video-LLMs) have recently shown strong performance in
basic video understanding tasks, such as captioning and coarse-grained question …