作者
Jiading Fang, Xiangshan Tan, Shengjie Lin, Hongyuan Mei, Matthew Walter
发表日期
2023/10/21
研讨会论文
2nd Workshop on Language and Robot Learning: Language as Grounding
简介
If robots are to work effectively alongside people, they must be able to interpret natural language references to objects in their 3D environment. Understanding 3D referring expressions is challenging---it requires the ability to both parse the 3D structure of the scene as well as to correctly ground free-form language in the presence of distraction and clutter. We propose Transcribe3D, a simple yet effective approach to interpreting 3D referring expressions, which converts 3D scene geometry into a textual representation and takes advantage of the common sense reasoning capability of large language models (LLMs) to make inferences about the objects in the scene and their interactions. We experimentally demonstrate that employing LLMs in this zero-shot fashion outperforms contemporary methods. We then improve upon the zero-shot version of \acronym by performing finetuning from self-correction in order to generalize. We show preliminary results on the Referit3D dataset with state-of-the-art performance. We also show that our method enables real robots to perform pick-and-place tasks given queries that contain challenging referring expressions.
引用总数
学术搜索中的文章