Aligning cyber space with physical world: A comprehensive survey on embodied ai

Y Liu, W Chen, Y Bai, J Luo, X Song, K Jiang… - arXiv preprint arXiv …, 2024 - arxiv.org
Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial General
Intelligence (AGI) and serves as a foundation for various applications that bridge cyberspace …

A survey of label-efficient deep learning for 3D point clouds

A Xiao, X Zhang, L Shao, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org
In the past decade, deep neural networks have achieved significant progress in point cloud
learning. However, collecting large-scale precisely-annotated point clouds is extremely …

When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models

X Ma, Y Bhalgat, B Smart, S Chen, X Li, J Ding… - arXiv preprint arXiv …, 2024 - arxiv.org
As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs)
has seen rapid progress, offering unprecedented capabilities for understanding and …

A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions

D Liu, Y Liu, W Huang, W Hu - arXiv preprint arXiv:2406.05785, 2024 - arxiv.org
Text-guided 3D visual grounding (T-3DVG), which aims to locate a specific object that
semantically corresponds to a language query from a complicated 3D scene, has drawn …

VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding

R Xu, Z Huang, T Wang, Y Chen, J Pang… - arXiv preprint arXiv …, 2024 - arxiv.org
3D visual grounding is crucial for robots, requiring integration of natural language and 3D
scene understanding. Traditional methods depending on supervised learning with 3D point …

3D Visual Grounding-Audio: 3D scene object detection based on audio

C Zhang, Z Cai, X Chen, F Da, S Gai - Neurocomputing, 2025 - Elsevier
Abstract 3D Visual Grounding (3DVG) is a prevalent multi-modal information fusion task
capable of accurately localizing target objects referenced in natural language descriptions …

Space3D-Bench: Spatial 3D Question Answering Benchmark

E Szymanska, M Dusmanu, JW Buurlage… - arXiv preprint arXiv …, 2024 - arxiv.org
Answering questions about the spatial properties of the environment poses challenges for
existing language and vision foundation models due to a lack of understanding of the 3D …

ViewInfer3D: 3D Visual Grounding based on Embodied Viewpoint Inference

L Geng, J Yin - IEEE Robotics and Automation Letters, 2024 - ieeexplore.ieee.org
3D Visual Grounding (3D VG) is a fundamental task in embodied intelligence, which entails
robots interpreting natural language descriptions to locate objects within 3D environments …

Automatic benchmarking of large multimodal models via iterative experiment programming

A Conti, E Fini, P Rota, Y Wang, M Mancini… - arXiv preprint arXiv …, 2024 - arxiv.org
Assessing the capabilities of large multimodal models (LMMs) often requires the creation of
ad-hoc evaluations. Currently, building new benchmarks requires tremendous amounts of …

A Survey of Language-Grounded Multimodal 3d Scene Understanding

R Ren, X Zhao, W Xu, J Cao, X Xu, X Zhang - Available at SSRN 4992295 - papers.ssrn.com
As an emergent task bridging vision and language, Language-grounded Multimodal 3D
Scene Understanding (3D-LMSU) has attracted significant interest across various domains …