Et bench: Towards open-ended event-level video-language understanding

Y Liu, Z Ma, Z Qi, Y Wu, Y Shan, CW Chen - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advances in Video Large Language Models (Video-LLMs) have demonstrated their
great potential in general-purpose video understanding. To verify the significance of these …

Saliency-guided detr for moment retrieval and highlight detection

A Gordeev, V Dokholyan, I Tolstykh… - arXiv preprint arXiv …, 2024 - arxiv.org
Existing approaches for video moment retrieval and highlight detection are not able to align
text and video features efficiently, resulting in unsatisfying performance and limited …

FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal Grounding

Z Cao, B Zhang, H Du, X Yu, X Li, S Wang - arXiv preprint arXiv …, 2024 - arxiv.org
Text-guided Video Temporal Grounding (VTG) aims to localize relevant segments in
untrimmed videos based on textual descriptions, encompassing two subtasks: Moment …

VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

D Paul, MR Parvez, N Mohammed… - arXiv preprint arXiv …, 2024 - arxiv.org
Video Highlight Detection and Moment Retrieval (HD/MR) are essential in video analysis.
Recent joint prediction transformer models often overlook their cross-task dynamics and …

LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection

P Zhao, Z He, F Zhang, S Lin, F Zhou - arXiv preprint arXiv:2501.10787, 2025 - arxiv.org
Video Moment Retrieval and Highlight Detection aim to find corresponding content in the
video based on a text query. Existing models usually first use contrastive learning methods …

Length-Aware DETR for Robust Moment Retrieval

S Park, J Choi, K Baek, H Shim - arXiv preprint arXiv:2412.20816, 2024 - arxiv.org
Video Moment Retrieval (MR) aims to localize moments within a video based on a given
natural language query. Given the prevalent use of platforms like YouTube for information …

D&M: Enriching E-commerce Videos with Sound Effects by Key Moment Detection and SFX Matching

J Liu, M Wang, Y Ma, B Wang, A Chen, Q Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Videos showcasing specific products are increasingly important for E-commerce. Key
moments naturally exist as the first appearance of a specific product, presentation of its …

: Exploring Embodied Emotion Through A Large-Scale Egocentric Video Dataset

W Lin, Y Feng, WK Han, T Jin, Z Zhao, F Wu… - The Thirty-eight … - openreview.net
Understanding human emotions is fundamental to enhancing human-computer interaction,
especially for embodied agents that mimic human behavior. Traditional emotion analysis …