Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's …
We consider the problem of referring image segmentation. Given an input image and a natural language expression, the goal is to segment the object referred by the language …
Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to …
H Chen, A Suhr, D Misra… - Proceedings of the …, 2019 - openaccess.thecvf.com
We study the problem of jointly reasoning about language and vision through a navigation and spatial reasoning task. We introduce the Touchdown task and dataset, where an agent …
In this paper, we study the task of image retrieval, where the input query is specified in the form of an image plus some text that describes desired modifications to the input image. For …
We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired by the following insight. The performances of existing propose-and-rank two-stage methods …
P Wang, Q Wu, J Cao, C Shen… - Proceedings of the …, 2019 - openaccess.thecvf.com
The task in referring expression comprehension is to localize the object instance in an image described by a referring expression phrased in natural language. As a language-to …
Most scene graph parsers use a two-stage pipeline to detect visual relationships: the first stage detects entities, and the second predicts the predicate for each entity pair using a …
S Yang, G Li, Y Yu - Proceedings of the IEEE/CVF …, 2019 - openaccess.thecvf.com
Referring expression comprehension aims to locate the object instance described by a natural language referring expression in an image. This task is compositional and inherently …