Multimodal machine learning: A survey and taxonomy

T Baltrušaitis, C Ahuja… - IEEE transactions on …, 2018 - ieeexplore.ieee.org
Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …

Video description: A survey of methods, datasets, and evaluation metrics

N Aafaq, A Mian, W Liu, SZ Gilani, M Shah - ACM Computing Surveys …, 2019 - dl.acm.org
Video description is the automatic generation of natural language sentences that describe
the contents of a given video. It has applications in human-robot interaction, helping the …

Msr-vtt: A large video description dataset for bridging video and language

J Xu, T Mei, T Yao, Y Rui - Proceedings of the IEEE …, 2016 - openaccess.thecvf.com
While there has been increasing interest in the task of describing video with natural
language, current computer vision algorithms are still severely limited in terms of the …

Deep visual-semantic alignments for generating image descriptions

A Karpathy, L Fei-Fei - Proceedings of the IEEE conference on …, 2015 - cv-foundation.org
We present a model that generates natural language descriptions of images and their
regions. Our approach leverages datasets of images and their sentence descriptions to …

Long-term recurrent convolutional networks for visual recognition and description

J Donahue, L Anne Hendricks… - Proceedings of the …, 2015 - openaccess.thecvf.com
Abstract Models comprised of deep convolutional network layers have dominated recent
image interpretation tasks; we investigate whether models which are also compositional, or" …

Movieqa: Understanding stories in movies through question-answering

M Tapaswi, Y Zhu, R Stiefelhagen… - Proceedings of the …, 2016 - openaccess.thecvf.com
We introduce the MovieQA dataset which aims to evaluate automatic story comprehension
from both video and text. The dataset consists of 14,944 questions about 408 movies with …

Describing videos by exploiting temporal structure

L Yao, A Torabi, K Cho, N Ballas… - Proceedings of the …, 2015 - openaccess.thecvf.com
Recent progress in using recurrent neural networks (RNNs) for image description has
motivated the exploration of their application for video description. However, while images …

Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning

N Aafaq, N Akhtar, W Liu, SZ Gilani… - Proceedings of the …, 2019 - openaccess.thecvf.com
Automatic generation of video captions is a fundamental challenge in computer vision.
Recent techniques typically employ a combination of Convolutional Neural Networks …

[PDF][PDF] Referitgame: Referring to objects in photographs of natural scenes

S Kazemzadeh, V Ordonez, M Matten… - Proceedings of the 2014 …, 2014 - aclanthology.org
In this paper we introduce a new game to crowd-source natural language referring
expressions. By designing a two player game, we can both collect and verify referring …

Automatic model construction with Gaussian processes

D Duvenaud - 2014 - repository.cam.ac.uk
This thesis develops a method for automatically constructing, visualizing and describing a
large class of models, useful for forecasting and finding structure in domains such as time …