ImageNet3D: Towards General-Purpose Object-Level 3D Understanding

W Ma, G Zeng, G Zhang, Q Liu, L Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
A vision model with general-purpose object-level 3D understanding should be capable of
inferring both 2D (eg, class name and bounding box) and 3D information (eg, 3D location …

Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering

X Wang, W Ma, A Wang, S Chen, A Kortylewski… - arXiv preprint arXiv …, 2024 - arxiv.org
For vision-language models (VLMs), understanding the dynamic properties of objects and
their interactions within 3D scenes from video is crucial for effective reasoning. In this work …

LCV2: A Universal Pretraining-Free Framework for Grounded Visual Question Answering

Y Chen, L Su, L Chen, Z Lin - Electronics, 2024 - mdpi.com
Grounded Visual Question Answering systems place heavy reliance on substantial
computational power and data resources in pretraining. In response to this challenge, this …