Generation and comprehension of unambiguous object descriptions

MDZ Hossain, F Sohel, MF Shiratuddin… - ACM Computing Surveys …, 2019 - dl.acm.org

Generating a description of an image is called image captioning. Image captioning requires
recognizing the important objects, their attributes, and their relationships in an image. It also …

被引用次数：888 相关文章所有 8 个版本

[PDF] thecvf.com

From recognition to cognition: Visual commonsense reasoning

R Zellers, Y Bisk, A Farhadi… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com

Visual understanding goes well beyond object recognition. With one glance at an image, we
can effortlessly imagine the world beyond the pixels: for instance, we can infer people's …

被引用次数：889 相关文章所有 7 个版本

[PDF] thecvf.com

Cross-modal self-attention network for referring image segmentation

L Ye, M Rochan, Z Liu, Y Wang - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com

We consider the problem of referring image segmentation. Given an input image and a
natural language expression, the goal is to segment the object referred by the language …

被引用次数：482 相关文章所有 5 个版本

[PDF] thecvf.com

Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation

X Wang, Q Huang, A Celikyilmaz… - Proceedings of the …, 2019 - openaccess.thecvf.com

Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out
natural language instructions inside real 3D environments. In this paper, we study how to …

被引用次数：552 相关文章所有 10 个版本

[PDF] thecvf.com

Touchdown: Natural language navigation and spatial reasoning in visual street environments

H Chen, A Suhr, D Misra… - Proceedings of the …, 2019 - openaccess.thecvf.com

We study the problem of jointly reasoning about language and vision through a navigation
and spatial reasoning task. We introduce the Touchdown task and dataset, where an agent …

被引用次数：369 相关文章所有 11 个版本

[PDF] thecvf.com

Composing text and image for image retrieval-an empirical odyssey

N Vo, L Jiang, C Sun, K Murphy, LJ Li… - Proceedings of the …, 2019 - openaccess.thecvf.com

In this paper, we study the task of image retrieval, where the input query is specified in the
form of an image plus some text that describes desired modifications to the input image. For …

被引用次数：355 相关文章所有 9 个版本

[PDF] thecvf.com

A fast and accurate one-stage approach to visual grounding

Z Yang, B Gong, L Wang, W Huang… - Proceedings of the …, 2019 - openaccess.thecvf.com

We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired
by the following insight. The performances of existing propose-and-rank two-stage methods …

被引用次数：348 相关文章所有 12 个版本

[PDF] thecvf.com

Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks

P Wang, Q Wu, J Cao, C Shen… - Proceedings of the …, 2019 - openaccess.thecvf.com

The task in referring expression comprehension is to localize the object instance in an
image described by a referring expression phrased in natural language. As a language-to …

被引用次数：262 相关文章所有 7 个版本

[PDF] thecvf.com

Graphical contrastive losses for scene graph parsing

J Zhang, KJ Shih, A Elgammal, A Tao… - Proceedings of the …, 2019 - openaccess.thecvf.com

Most scene graph parsers use a two-stage pipeline to detect visual relationships: the first
stage detects entities, and the second predicts the predicate for each entity pair using a …

被引用次数：252 相关文章所有 7 个版本

[PDF] thecvf.com

Dynamic graph attention for referring expression comprehension

S Yang, G Li, Y Yu - Proceedings of the IEEE/CVF …, 2019 - openaccess.thecvf.com

Referring expression comprehension aims to locate the object instance described by a
natural language referring expression in an image. This task is compositional and inherently …

被引用次数：206 相关文章所有 7 个版本

高级搜索

QQ 群