Multi-level visual representation with semantic-reinforced learning for video captioning- 学术资源搜索

Multi-level visual representation with semantic-reinforced learning for video captioning

C Dong, X Chen, A Chen, F Hu, Z Wang… - Proceedings of the 29th …, 2021 - dl.acm.org

C Dong, X Chen, A Chen, F Hu, Z Wang, X Li

Proceedings of the 29th ACM international conference on multimedia, 2021•dl.acm.org

This paper describes our bronze-medal solution for the video captioning task of the ACMMM2021 Pre-Training for Video Understanding Challenge. We depart from the Bottom-Up-Top-Down model, with technical improvements on both video content encoding and caption decoding. For encoding, we propose to extract multi-level video features that describe holistic scenes and fine-grained key objects, respectively. The scene-level and object-level features are enhanced separately by multi-head self-attention mechanisms before feeding them into the decoding module. Towards generating content-relevant and human-like captions, we train our network end-to-end by semantic-reinforced learning. Finally, in order to select the best caption from captions produced by distinct models, we perform caption reranking by cross-modal matching between a given video and each candidate caption. Both internal experiments on the MSR-VTT test set and external evaluations by the challenge organizers justify the viability of the proposed solution.

ACM Digital Library

展开收起

被引用次数：8 相关文章所有 2 个版本

以上显示的是最相近的搜索结果。查看全部搜索结果

高级搜索

QQ 群

Multi-level visual representation with semantic-reinforced learning for video captioning

引用