A comprehensive survey of deep learning for image captioning

MDZ Hossain, F Sohel, MF Shiratuddin… - ACM Computing Surveys …, 2019 - dl.acm.org
Generating a description of an image is called image captioning. Image captioning requires
recognizing the important objects, their attributes, and their relationships in an image. It also …

From recognition to cognition: Visual commonsense reasoning

R Zellers, Y Bisk, A Farhadi… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
Visual understanding goes well beyond object recognition. With one glance at an image, we
can effortlessly imagine the world beyond the pixels: for instance, we can infer people's …

Cross-modal self-attention network for referring image segmentation

L Ye, M Rochan, Z Liu, Y Wang - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
We consider the problem of referring image segmentation. Given an input image and a
natural language expression, the goal is to segment the object referred by the language …

Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation

X Wang, Q Huang, A Celikyilmaz… - Proceedings of the …, 2019 - openaccess.thecvf.com
Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out
natural language instructions inside real 3D environments. In this paper, we study how to …

Touchdown: Natural language navigation and spatial reasoning in visual street environments

H Chen, A Suhr, D Misra… - Proceedings of the …, 2019 - openaccess.thecvf.com
We study the problem of jointly reasoning about language and vision through a navigation
and spatial reasoning task. We introduce the Touchdown task and dataset, where an agent …

Composing text and image for image retrieval-an empirical odyssey

N Vo, L Jiang, C Sun, K Murphy, LJ Li… - Proceedings of the …, 2019 - openaccess.thecvf.com
In this paper, we study the task of image retrieval, where the input query is specified in the
form of an image plus some text that describes desired modifications to the input image. For …

A fast and accurate one-stage approach to visual grounding

Z Yang, B Gong, L Wang, W Huang… - Proceedings of the …, 2019 - openaccess.thecvf.com
We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired
by the following insight. The performances of existing propose-and-rank two-stage methods …

Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks

P Wang, Q Wu, J Cao, C Shen… - Proceedings of the …, 2019 - openaccess.thecvf.com
The task in referring expression comprehension is to localize the object instance in an
image described by a referring expression phrased in natural language. As a language-to …

Graphical contrastive losses for scene graph parsing

J Zhang, KJ Shih, A Elgammal, A Tao… - Proceedings of the …, 2019 - openaccess.thecvf.com
Most scene graph parsers use a two-stage pipeline to detect visual relationships: the first
stage detects entities, and the second predicts the predicate for each entity pair using a …

Dynamic graph attention for referring expression comprehension

S Yang, G Li, Y Yu - Proceedings of the IEEE/CVF …, 2019 - openaccess.thecvf.com
Referring expression comprehension aims to locate the object instance described by a
natural language referring expression in an image. This task is compositional and inherently …