Generation and comprehension of unambiguous object descriptions

S Tellex, N Gopalan, H Kress-Gazit… - Annual Review of …, 2020 - annualreviews.org

This article surveys the use of natural language in robotics from a robotics point of view. To
use human language, robots must map words to aspects of the physical world, mediated by …

被引用次数：245 相关文章所有 5 个版本

[PDF] thecvf.com

12-in-1: Multi-task vision and language representation learning

J Lu, V Goswami, M Rohrbach… - Proceedings of the …, 2020 - openaccess.thecvf.com

Much of vision-and-language research focuses on a small but diverse set of independent
tasks and supporting datasets often studied in isolation; however, the visually-grounded …

被引用次数：514 相关文章所有 7 个版本

[PDF] arxiv.org

Improving one-stage visual grounding by recursive sub-query construction

Z Yang, T Chen, L Wang, J Luo - … Conference, Glasgow, UK, August 23–28 …, 2020 - Springer

We improve one-stage visual grounding by addressing current limitations on grounding long
and complex queries. Existing one-stage methods encode the entire language query as a …

被引用次数：204 相关文章所有 9 个版本

[PDF] thecvf.com

Multi-task collaborative network for joint referring expression comprehension and segmentation

G Luo, Y Zhou, X Sun, L Cao, C Wu… - Proceedings of the …, 2020 - openaccess.thecvf.com

Referring expression comprehension (REC) and segmentation (RES) are two highly-related
tasks, which both aim at identifying the referent according to a natural language expression …

被引用次数：256 相关文章所有 8 个版本

[PDF] arxiv.org

Scanrefer: 3d object localization in rgb-d scans using natural language

DZ Chen, AX Chang, M Nießner - European conference on computer …, 2020 - Springer

We introduce the task of 3D object localization in RGB-D scans using natural language
descriptions. As input, we assume a point cloud of a scanned 3D scene along with a free …

被引用次数：237 相关文章所有 5 个版本

[PDF] thecvf.com

Reverie: Remote embodied visual referring expression in real indoor environments

Y Qi, Q Wu, P Anderson, X Wang… - Proceedings of the …, 2020 - openaccess.thecvf.com

One of the long-term challenges of robotics is to enable robots to interact with humans in the
visual world via natural language, as humans are visual animals that communicate through …

被引用次数：271 相关文章所有 10 个版本

[PDF] arxiv.org

Connecting vision and language with localized narratives

J Pont-Tuset, J Uijlings, S Changpinyo… - Computer Vision–ECCV …, 2020 - Springer

Abstract We propose Localized Narratives, a new form of multimodal image annotations
connecting vision and language. We ask annotators to describe an image with their voice …

被引用次数：210 相关文章所有 7 个版本

[PDF] arxiv.org

Linguistic structure guided context modeling for referring image segmentation

T Hui, S Liu, S Huang, G Li, S Yu, F Zhang… - Computer Vision–ECCV …, 2020 - Springer

Referring image segmentation aims to predict the foreground mask of the object referred by
a natural language sentence. Multimodal context of the sentence is crucial to distinguish the …

被引用次数：127 相关文章所有 9 个版本

[PDF] thecvf.com

A real-time cross-modality correlation filtering method for referring expression comprehension

Y Liao, S Liu, G Li, F Wang, Y Chen… - Proceedings of the …, 2020 - openaccess.thecvf.com

Referring expression comprehension aims to localize the object instance described by a
natural language expression. Current referring expression methods have achieved good …

被引用次数：181 相关文章所有 13 个版本

[PDF] thecvf.com

Referring image segmentation via cross-modal progressive comprehension

S Huang, T Hui, S Liu, G Li, Y Wei… - Proceedings of the …, 2020 - openaccess.thecvf.com

Referring image segmentation aims at segmenting the foreground masks of the entities that
can well match the description given in the natural language expression. Previous …

被引用次数：161 相关文章所有 13 个版本

高级搜索

QQ 群