Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded …
We improve one-stage visual grounding by addressing current limitations on grounding long and complex queries. Existing one-stage methods encode the entire language query as a …
Referring expression comprehension (REC) and segmentation (RES) are two highly-related tasks, which both aim at identifying the referent according to a natural language expression …
We introduce the task of 3D object localization in RGB-D scans using natural language descriptions. As input, we assume a point cloud of a scanned 3D scene along with a free …
One of the long-term challenges of robotics is to enable robots to interact with humans in the visual world via natural language, as humans are visual animals that communicate through …
Abstract We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe an image with their voice …
T Hui, S Liu, S Huang, G Li, S Yu, F Zhang… - Computer Vision–ECCV …, 2020 - Springer
Referring image segmentation aims to predict the foreground mask of the object referred by a natural language sentence. Multimodal context of the sentence is crucial to distinguish the …
Y Liao, S Liu, G Li, F Wang, Y Chen… - Proceedings of the …, 2020 - openaccess.thecvf.com
Referring expression comprehension aims to localize the object instance described by a natural language expression. Current referring expression methods have achieved good …
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous …