To create what you tell: Generating videos from captions

Y Pan, Z Qiu, T Yao, H Li, T Mei - Proceedings of the 25th ACM …, 2017 - dl.acm.org
We are creating multimedia contents everyday and everywhere. While automatic content
generation has played a fundamental challenge to multimedia community for decades …

Video captioning by adversarial LSTM

Y Yang, J Zhou, J Ai, Y Bin, A Hanjalic… - … on Image Processing, 2018 - ieeexplore.ieee.org
In this paper, we propose a novel approach to video captioning based on adversarial
learning and long short-term memory (LSTM). With this solution concept, we aim at …

Stylevideogan: A temporal generative model using a pretrained stylegan

G Fox, A Tewari, M Elgharib, C Theobalt - arXiv preprint arXiv:2107.07224, 2021 - arxiv.org
Generative adversarial models (GANs) continue to produce advances in terms of the visual
quality of still images, as well as the learning of temporal correlations. However, few works …

Adversarial inference for multi-sentence video description

JS Park, M Rohrbach, T Darrell… - Proceedings of the …, 2019 - openaccess.thecvf.com
While significant progress has been made in the image captioning task, video description is
still in its infancy due to the complex nature of video data. Generating multi-sentence …

Temporal generative adversarial nets with singular value clipping

M Saito, E Matsumoto, S Saito - Proceedings of the IEEE …, 2017 - openaccess.thecvf.com
In this paper, we propose a generative model, Temporal Generative Adversarial Nets
(TGAN), which can learn a semantic representation of unlabeled videos, and is capable of …

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

TS Chen, A Siarohin, W Menapace… - Proceedings of the …, 2024 - openaccess.thecvf.com
The quality of the data and annotation upper-bounds the quality of a downstream model.
While there exist large text corpora and image-text pairs high-quality video-text data is much …

SBAT: Video captioning with sparse boundary-aware transformer

T Jin, S Huang, M Chen, Y Li, Z Zhang - arXiv preprint arXiv:2007.11888, 2020 - arxiv.org
In this paper, we focus on the problem of applying the transformer structure to video
captioning effectively. The vanilla transformer is proposed for uni-modal language …

End-to-end generative pretraining for multimodal video captioning

PH Seo, A Nagrani, A Arnab… - Proceedings of the …, 2022 - openaccess.thecvf.com
Recent video and language pretraining frameworks lack the ability to generate sentences.
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining …

EMScore: Evaluating video captioning via coarse-grained and fine-grained embedding matching

Y Shi, X Yang, H Xu, C Yuan, B Li… - Proceedings of the …, 2022 - openaccess.thecvf.com
Current metrics for video captioning are mostly based on the text-level comparison between
reference and candidate captions. However, they have some insuperable drawbacks, eg …

End-to-end dense video captioning as sequence generation

W Zhu, B Pang, AV Thapliyal, WY Wang… - arXiv preprint arXiv …, 2022 - arxiv.org
Dense video captioning aims to identify the events of interest in an input video, and generate
descriptive captions for each event. Previous approaches usually follow a two-stage …