Temporal sentence grounding in videos: A survey and future directions

H Zhang, A Sun, W Jing, JT Zhou - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …

You can ground earlier than see: An effective and efficient pipeline for temporal sentence grounding in compressed videos

X Fang, D Liu, P Zhou, G Nan - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Given an untrimmed video, temporal sentence grounding (TSG) aims to locate a target
moment semantically according to a sentence query. Although previous respectable works …

Hypotheses tree building for one-shot temporal sentence localization

D Liu, X Fang, P Zhou, X Di, W Lu… - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Given an untrimmed video, temporal sentence localization (TSL) aims to localize a specific
segment according to a given sentence query. Though respectable works have made …

Faster video moment retrieval with point-level supervision

X Jiang, Z Zhou, X Xu, Y Yang, G Wang… - Proceedings of the 31st …, 2023 - dl.acm.org
Video Moment Retrieval (VMR) aims at retrieving the most relevant events from an
untrimmed video with natural language queries. Existing VMR methods suffer from two …

Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding

C Tan, J Lai, WS Zheng, JF Hu - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract Video Paragraph Grounding (VPG) is an emerging task in video-language
understanding which aims at localizing multiple sentences with semantic relations and …

Learning temporal sentence grounding from narrated egovideos

K Flanagan, D Damen, M Wray - arXiv preprint arXiv:2310.17395, 2023 - arxiv.org
The onset of long-form egocentric datasets such as Ego4D and EPIC-Kitchens presents a
new challenge for the task of Temporal Sentence Grounding (TSG). Compared to traditional …

Learning point-language hierarchical alignment for 3D visual grounding

J Chen, W Luo, R Song, X Wei, L Ma… - arXiv preprint arXiv …, 2022 - arxiv.org
This paper presents a novel hierarchical alignment model (HAM) that learns multi-
granularity visual and linguistic representations in an end-to-end manner. We extract key …

Gaussian Mixture Proposals with Pull-Push Learning Scheme to Capture Diverse Events for Weakly Supervised Temporal Video Grounding

S Kim, J Cho, J Yu, YJ Yoo, JY Choi - Proceedings of the AAAI …, 2024 - ojs.aaai.org
In the weakly supervised temporal video grounding study, previous methods use
predetermined single Gaussian proposals which lack the ability to express diverse events …

EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model

G Li, X Ding, D Cheng, J Li, N Wang, X Gao - arXiv preprint arXiv …, 2023 - arxiv.org
Early weakly supervised video grounding (WSVG) methods often struggle with incomplete
boundary detection due to the absence of temporal boundary annotations. To bridge the gap …

Zero-Shot Video Moment Retrieval Using BLIP-Based Models

JI Wattasseril, S Shekhar, J Döllner, M Trapp - International Symposium on …, 2023 - Springer
Abstract Video Moment Retrieval (VMR) is a challenging task at the intersection of vision
and language, with the goal to retrieve relevant moments from videos corresponding to …