查看文章

arxiv.org 中的 [PDF]

Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs

作者

Jialou Wang, Manli Zhu, Yulei Li, Honglei Li, Longzhi Yang, Wai Lok Woo

发表日期

2024/4/3

期刊

IEEE Intelligent Systems

出版商

IEEE

简介

Localization plays a crucial role in enhancing the practicality and precision of VQA systems. By enabling fine-grained identification and interaction with specific parts of an object, it significantly improves the system’s ability to provide contextually relevant and spatially accurate responses, crucial for applications in dynamic environments like robotics and augmented reality. However, traditional systems face challenges in accurately mapping objects within images to generate nuanced and spatially aware responses. In this work, we introduce “Detect2Interact”, which addresses these challenges by introducing an advanced approach for fine-grained object visual key field detection. First, we use the segment anything model (SAM) to generate detailed spatial maps of objects in images. Next, we use Vision Studio to extract semantic object descriptions. Third, we employ GPT-4’s common sense knowledge, bridging the …

学术搜索中的文章

Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs

J Wang, M Zhu, Y Li, H Li, L Yang, WL Woo - IEEE Intelligent Systems, 2024