Visual question answering: A survey of methods and datasets

Q Wu, D Teney, P Wang, C Shen, A Dick… - Computer Vision and …, 2017 - Elsevier
Abstract Visual Question Answering (VQA) is a challenging task that has received increasing
attention from both the computer vision and the natural language processing communities …

Localizing moments in video with natural language

L Anne Hendricks, O Wang… - Proceedings of the …, 2017 - openaccess.thecvf.com
We consider retrieving a specific temporal segment, or moment, from a video given a natural
language text description. Methods designed to retrieve whole video clips with natural …

Tall: Temporal activity localization via language query

J Gao, C Sun, Z Yang… - Proceedings of the IEEE …, 2017 - openaccess.thecvf.com
This paper focuses on temporal localization of actions from untrimmed videos. Existing
methods typically involve training classifiers for a pre-defined list of actions and applying the …

Guesswhat?! visual object discovery through multi-modal dialogue

H De Vries, F Strub, S Chandar… - Proceedings of the …, 2017 - openaccess.thecvf.com
We introduce GuessWhat?!, a two-player guessing game as a testbed for research on the
interplay of computer vision and dialogue systems. The goal of the game is to locate an …

Modeling relationships in referential expressions with compositional modular networks

R Hu, M Rohrbach, J Andreas… - Proceedings of the …, 2017 - openaccess.thecvf.com
People often refer to entities in an image in terms of their relationships with other entities. For
example," the black cat sitting under the table" refers to both a" black cat" entity and its …

A joint speaker-listener-reinforcer model for referring expressions

L Yu, H Tan, M Bansal, TL Berg - Proceedings of the IEEE …, 2017 - openaccess.thecvf.com
Referring expressions are natural language constructions used to identify particular objects
within a scene. In this paper, we propose a unified framework for the tasks of referring …

Recurrent multimodal interaction for referring image segmentation

C Liu, Z Lin, X Shen, J Yang, X Lu… - Proceedings of the …, 2017 - openaccess.thecvf.com
In this paper we are interested in the problem of image segmentation given natural
language descriptions, ie referring expressions. Existing works tackle this problem by first …

Attention correctness in neural image captioning

C Liu, J Mao, F Sha, A Yuille - Proceedings of the AAAI conference on …, 2017 - ojs.aaai.org
Attention mechanisms have recently been introduced in deep learning for various tasks in
natural language processing and computer vision. But despite their popularity …

Weakly-supervised learning of visual relations

J Peyre, J Sivic, I Laptev… - Proceedings of the ieee …, 2017 - openaccess.thecvf.com
This paper introduces a novel approach for modeling visual relations between pairs of
objects. We call relation a triplet of the form (subject, predicate, object) where the predicate …

Contrastive learning for image captioning

B Dai, D Lin - Advances in Neural Information Processing …, 2017 - proceedings.neurips.cc
Image captioning, a popular topic in computer vision, has achieved substantial progress in
recent years. However, the distinctiveness of natural descriptions is often overlooked in …