H Zhen, X Qiu,
P Chen, J Yang, X Yan,
Y Du… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the
broader realm of the 3D physical world. Furthermore, they perform action prediction by …