查看文章

thecvf.com 中的 [PDF]

SnAG: Scalable and Accurate Video Grounding

作者

Fangzhou Mu, Sicheng Mo, Yin Li

发表日期

2024

研讨会论文

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

页码范围

18930-18940

简介

Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability---they have been optimized for grounding only a few text queries within short videos and fail to scale up to long videos with hundreds of queries. In this paper we study the effect of cross-modal fusion on the scalability of video grounding models. Our analysis establishes late fusion as a more cost-effective fusion scheme for long-form videos with many text queries. Moreover it leads us to a novel video-centric sampling scheme for efficient training. Based on these findings we present SnAG a simple baseline for scalable and accurate video grounding. Without bells and whistles SnAG is 43% more accurate and 1.5 x faster than CONE a state of the art for long-form video grounding on the challenging MAD dataset while achieving highly competitive results on short videos.

学术搜索中的文章

SnAG: Scalable and Accurate Video Grounding

F Mu, S Mo, Y Li - Proceedings of the IEEE/CVF Conference on Computer …, 2024