Rethinking weakly-supervised video temporal grounding from a game perspective

X Fang, Z Xiong, W Fang, X Qu, C Chen, J Dong… - … on Computer Vision, 2025 - Springer
This paper addresses the challenging task of weakly-supervised video temporal grounding.
Existing approaches are generally based on the moment proposal selection framework that …

Rethinking Video Sentence Grounding From a Tracking Perspective With Memory Network and Masked Attention

Z Xiong, D Liu, X Fang, X Qu, J Dong… - IEEE Transactions …, 2024 - ieeexplore.ieee.org
Video sentence grounding (VSG) is the task of identifying the segment of an untrimmed
video that semantically corresponds to a given natural language query. While many existing …

Filling the Information Gap between Video and Query for Language-Driven Moment Retrieval

D Liu, X Qu, J Dong, G Nan, P Zhou, Z Xu… - Proceedings of the 31st …, 2023 - dl.acm.org
This paper addresses the challenging task of language-driven moment retrieval. Previous
methods are typically trained to localize the target moment corresponding to a single …

Transform-Equivariant Consistency Learning for Temporal Sentence Grounding

D Liu, X Qu, J Dong, P Zhou, Z Xu, H Wang… - ACM Transactions on …, 2024 - dl.acm.org
This paper addresses the temporal sentence grounding (TSG). Although existing methods
have made decent achievements in this task, they not only severely rely on abundant video …

Probability distribution based frame-supervised language-driven action localization

S Yang, Z Shang, X Wu - Proceedings of the 31st ACM International …, 2023 - dl.acm.org
Frame-supervised language-driven action localization aims to localize action boundaries in
untrimmed videos corresponding to the input natural language query, with only a single …

Efficient Language-Driven Action Localization by Feature Aggregation and Prediction Adjustment

Z Shang, S Yang, X Wu - Chinese Conference on Pattern Recognition and …, 2024 - Springer
Abstract Language-driven action localization is a challenging task that aims to identify action
boundaries, namely the start and end timestamps, within untrimmed videos using natural …