Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning

G Rafiq, M Rafiq, GS Choi - Artificial Intelligence Review, 2023 - Springer

Video description refers to understanding visual content and transforming that acquired
understanding into automatic textual narration. It bridges the key AI fields of computer vision …

被引用次数：18 相关文章所有 5 个版本

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

被引用次数：157 相关文章所有 26 个版本

[PDF] port.ac.uk

Visuals to text: A comprehensive review on automatic image captioning

Y Ming, N Hu, C Fan, F Feng… - IEEE/CAA Journal of …, 2022 - researchportal.port.ac.uk

Image captioning refers to automatic generation of descriptive texts according to the visual
content of images. It is a technique integrating multiple disciplines including the computer …

被引用次数：39 相关文章所有 6 个版本

[PDF] thecvf.com

Autoad ii: The sequel-who, when, and what in movie audio description

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com

Audio Description (AD) is the task of generating descriptions of visual content, at suitable
time intervals, for the benefit of visually impaired audiences. For movies, this presents …

被引用次数：20 相关文章所有 7 个版本

[PDF] thecvf.com

AutoAD: Movie description in context

T Han, M Bain, A Nagrani, G Varol… - Proceedings of the …, 2023 - openaccess.thecvf.com

The objective of this paper is an automatic Audio Description (AD) model that ingests movies
and outputs AD in text form. Generating high-quality movie AD is challenging due to the …

被引用次数：33 相关文章所有 7 个版本

[PDF] thecvf.com

Weakly supervised temporal sentence grounding with gaussian-based contrastive proposal learning

M Zheng, Y Huang, Q Chen… - Proceedings of the …, 2022 - openaccess.thecvf.com

Temporal sentence grounding aims to detect the most salient moment corresponding to the
natural language query from untrimmed videos. As labeling the temporal boundaries is labor …

被引用次数：65 相关文章所有 5 个版本

[PDF] thecvf.com

Are binary annotations sufficient? video moment retrieval via hierarchical uncertainty-based active learning

W Ji, R Liang, Z Zheng, W Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Recent research on video moment retrieval has mostly focused on enhancing the
performance of accuracy, efficiency, and robustness, all of which largely rely on the …

被引用次数：22 相关文章所有 7 个版本

[PDF] aaai.org

Weakly supervised video moment localization with contrastive negative sample mining

M Zheng, Y Huang, Q Chen, Y Liu - … of the AAAI Conference on Artificial …, 2022 - ojs.aaai.org

Video moment localization aims at localizing the video segments which are most related to
the given free-form natural language query. The weakly supervised setting, where only …

被引用次数：56 相关文章所有 7 个版本

[PDF] thecvf.com

Weakly supervised temporal sentence grounding with uncertainty-guided self-training

Y Huang, L Yang, Y Sato - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com

The task of weakly supervised temporal sentence grounding aims at finding the
corresponding temporal moments of a language description in the video, given video …

被引用次数：19 相关文章所有 3 个版本

[PDF] acm.org

A survey on temporal sentence grounding in videos

X Lan, Y Yuan, X Wang, Z Wang, W Zhu - ACM Transactions on …, 2023 - dl.acm.org

Temporal sentence grounding in videos (TSGV), which aims at localizing one target
segment from an untrimmed video with respect to a given sentence query, has drawn …

被引用次数：48 相关文章所有 8 个版本

高级搜索

QQ 群