Generation and comprehension of unambiguous object descriptions

Q Wu, D Teney, P Wang, C Shen, A Dick… - Computer Vision and …, 2017 - Elsevier

Abstract Visual Question Answering (VQA) is a challenging task that has received increasing
attention from both the computer vision and the natural language processing communities …

被引用次数：454 相关文章所有 6 个版本

[PDF] thecvf.com

Localizing moments in video with natural language

L Anne Hendricks, O Wang… - Proceedings of the …, 2017 - openaccess.thecvf.com

We consider retrieving a specific temporal segment, or moment, from a video given a natural
language text description. Methods designed to retrieve whole video clips with natural …

被引用次数：919 相关文章所有 10 个版本

[PDF] thecvf.com

Tall: Temporal activity localization via language query

J Gao, C Sun, Z Yang… - Proceedings of the IEEE …, 2017 - openaccess.thecvf.com

This paper focuses on temporal localization of actions from untrimmed videos. Existing
methods typically involve training classifiers for a pre-defined list of actions and applying the …

被引用次数：764 相关文章所有 8 个版本

[PDF] thecvf.com

Guesswhat?! visual object discovery through multi-modal dialogue

H De Vries, F Strub, S Chandar… - Proceedings of the …, 2017 - openaccess.thecvf.com

We introduce GuessWhat?!, a two-player guessing game as a testbed for research on the
interplay of computer vision and dialogue systems. The goal of the game is to locate an …

被引用次数：457 相关文章所有 13 个版本

[PDF] thecvf.com

Modeling relationships in referential expressions with compositional modular networks

R Hu, M Rohrbach, J Andreas… - Proceedings of the …, 2017 - openaccess.thecvf.com

People often refer to entities in an image in terms of their relationships with other entities. For
example," the black cat sitting under the table" refers to both a" black cat" entity and its …

被引用次数：415 相关文章所有 12 个版本

[PDF] thecvf.com

A joint speaker-listener-reinforcer model for referring expressions

L Yu, H Tan, M Bansal, TL Berg - Proceedings of the IEEE …, 2017 - openaccess.thecvf.com

Referring expressions are natural language constructions used to identify particular objects
within a scene. In this paper, we propose a unified framework for the tasks of referring …

被引用次数：294 相关文章所有 12 个版本

[PDF] thecvf.com

Recurrent multimodal interaction for referring image segmentation

C Liu, Z Lin, X Shen, J Yang, X Lu… - Proceedings of the …, 2017 - openaccess.thecvf.com

In this paper we are interested in the problem of image segmentation given natural
language descriptions, ie referring expressions. Existing works tackle this problem by first …

被引用次数：249 相关文章所有 9 个版本

[PDF] aaai.org

Attention correctness in neural image captioning

C Liu, J Mao, F Sha, A Yuille - Proceedings of the AAAI conference on …, 2017 - ojs.aaai.org

Attention mechanisms have recently been introduced in deep learning for various tasks in
natural language processing and computer vision. But despite their popularity …

被引用次数：282 相关文章所有 13 个版本

[PDF] thecvf.com

Weakly-supervised learning of visual relations

J Peyre, J Sivic, I Laptev… - Proceedings of the ieee …, 2017 - openaccess.thecvf.com

This paper introduces a novel approach for modeling visual relations between pairs of
objects. We call relation a triplet of the form (subject, predicate, object) where the predicate …

被引用次数：230 相关文章所有 13 个版本

[PDF] neurips.cc

Contrastive learning for image captioning

B Dai, D Lin - Advances in Neural Information Processing …, 2017 - proceedings.neurips.cc

Image captioning, a popular topic in computer vision, has achieved substantial progress in
recent years. However, the distinctiveness of natural descriptions is often overlooked in …

被引用次数：204 相关文章所有 6 个版本

高级搜索

QQ 群