Improving video captioning with temporal composition of a visual-syntactic embedding

J Perez-Martin, B Bustos… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Proceedings of the IEEE/CVF Winter Conference on Applications …, 2021openaccess.thecvf.com
Video captioning is the task of predicting a semantic and syntactically correct sequence of
words given some context video. The most successful methods for video captioning have a
strong dependency on the effectiveness of semantic representations learned from visual
models, but often produce syntactically incorrect sentences which harms their performance
on standard datasets. In this paper, we address this limitation by considering syntactic
representation learning as an essential component of video captioning. We construct a …
Abstract
Video captioning is the task of predicting a semantic and syntactically correct sequence of words given some context video. The most successful methods for video captioning have a strong dependency on the effectiveness of semantic representations learned from visual models, but often produce syntactically incorrect sentences which harms their performance on standard datasets. In this paper, we address this limitation by considering syntactic representation learning as an essential component of video captioning. We construct a visual-syntactic embedding by mapping into a common vector space a visual representation, that depends only on the video, with a syntactic representation that depends only on Part-of-Speech (POS) tagging structures of the video description. We integrate this joint representation into an encoder-decoder architecture that we call Visual-Semantic-Syntactic Aligned Network (SemSynAN), which guides the decoder (text generation stage) by aligning temporal compositions of visual, semantic, and syntactic representations. We tested our proposed architecture obtaining state-of-the-art results on two widely used video captioning datasets: the Microsoft Video Description (MSVD) dataset and the Microsoft Research Video-to-Text (MSR-VTT) dataset.
openaccess.thecvf.com
以上显示的是最相近的搜索结果。 查看全部搜索结果