Video in sentences out

T Baltrušaitis, C Ahuja… - IEEE transactions on …, 2018 - ieeexplore.ieee.org

Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …

被引用次数：3790 相关文章所有 12 个版本

[PDF] acm.org

Video description: A survey of methods, datasets, and evaluation metrics

N Aafaq, A Mian, W Liu, SZ Gilani, M Shah - ACM Computing Surveys …, 2019 - dl.acm.org

Video description is the automatic generation of natural language sentences that describe
the contents of a given video. It has applications in human-robot interaction, helping the …

被引用次数：256 相关文章所有 10 个版本

[PDF] thecvf.com

Msr-vtt: A large video description dataset for bridging video and language

J Xu, T Mei, T Yao, Y Rui - Proceedings of the IEEE …, 2016 - openaccess.thecvf.com

While there has been increasing interest in the task of describing video with natural
language, current computer vision algorithms are still severely limited in terms of the …

被引用次数：2256 相关文章所有 10 个版本

[PDF] cv-foundation.org

Deep visual-semantic alignments for generating image descriptions

A Karpathy, L Fei-Fei - Proceedings of the IEEE conference on …, 2015 - cv-foundation.org

We present a model that generates natural language descriptions of images and their
regions. Our approach leverages datasets of images and their sentence descriptions to …

被引用次数：7333 相关文章所有 39 个版本

[PDF] thecvf.com

Long-term recurrent convolutional networks for visual recognition and description

J Donahue, L Anne Hendricks… - Proceedings of the …, 2015 - openaccess.thecvf.com

Abstract Models comprised of deep convolutional network layers have dominated recent
image interpretation tasks; we investigate whether models which are also compositional, or" …

被引用次数：8037 相关文章所有 25 个版本

[PDF] thecvf.com

Movieqa: Understanding stories in movies through question-answering

M Tapaswi, Y Zhu, R Stiefelhagen… - Proceedings of the …, 2016 - openaccess.thecvf.com

We introduce the MovieQA dataset which aims to evaluate automatic story comprehension
from both video and text. The dataset consists of 14,944 questions about 408 movies with …

被引用次数：848 相关文章所有 13 个版本

[PDF] thecvf.com

Describing videos by exploiting temporal structure

L Yao, A Torabi, K Cho, N Ballas… - Proceedings of the …, 2015 - openaccess.thecvf.com

Recent progress in using recurrent neural networks (RNNs) for image description has
motivated the exploration of their application for video description. However, while images …

被引用次数：1357 相关文章所有 14 个版本

[PDF] thecvf.com

Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning

N Aafaq, N Akhtar, W Liu, SZ Gilani… - Proceedings of the …, 2019 - openaccess.thecvf.com

Automatic generation of video captions is a fundamental challenge in computer vision.
Recent techniques typically employ a combination of Convolutional Neural Networks …

被引用次数：286 相关文章所有 11 个版本

[PDF] aclanthology.org

[PDF][PDF] Referitgame: Referring to objects in photographs of natural scenes

S Kazemzadeh, V Ordonez, M Matten… - Proceedings of the 2014 …, 2014 - aclanthology.org

In this paper we introduce a new game to crowd-source natural language referring
expressions. By designing a two player game, we can both collect and verify referring …

被引用次数：1365 相关文章所有 14 个版本

[PDF] cam.ac.uk

Automatic model construction with Gaussian processes

D Duvenaud - 2014 - repository.cam.ac.uk

This thesis develops a method for automatically constructing, visualizing and describing a
large class of models, useful for forecasting and finding structure in domains such as time …

被引用次数：970 相关文章所有 3 个版本

高级搜索

QQ 群

Multimodal machine learning: A survey and taxonomy

Video description: A survey of methods, datasets, and evaluation metrics

Msr-vtt: A large video description dataset for bridging video and language

Deep visual-semantic alignments for generating image descriptions

Long-term recurrent convolutional networks for visual recognition and description

Movieqa: Understanding stories in movies through question-answering

Describing videos by exploiting temporal structure

Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning

[PDF][PDF] Referitgame: Referring to objects in photographs of natural scenes

Automatic model construction with Gaussian processes

引用