R Ren, X Zhao, W Xu, J Cao, X Xu, X Zhang - Available at SSRN 4992295 - papers.ssrn.com
As an emergent task bridging vision and language, Language-grounded Multimodal 3D
Scene Understanding (3D-LMSU) has attracted significant interest across various domains …