State of the art on diffusion models for visual computing

R Po, W Yifan, V Golyanik, K Aberman… - Computer Graphics …, 2024 - Wiley Online Library
The field of visual computing is rapidly advancing due to the emergence of generative
artificial intelligence (AI), which unlocks unprecedented capabilities for the generation …

Scalable 3d captioning with pretrained models

T Luo, C Rockwell, H Lee… - Advances in Neural …, 2024 - proceedings.neurips.cc
We introduce Cap3D, an automatic approach for generating descriptive text for 3D objects.
This approach utilizes pretrained models from image captioning, image-text alignment, and …

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

L Xue, N Yu, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recent advancements in multimodal pre-training have shown promising efficacy in 3D
representation learning by aligning multimodal features across 3D shapes their 2D …

Openshape: Scaling up 3d shape representation towards open-world understanding

M Liu, R Shi, K Kuang, Y Zhu, X Li… - Advances in neural …, 2024 - proceedings.neurips.cc
We introduce OpenShape, a method for learning multi-modal joint representations of text,
image, and point clouds. We adopt the commonly used multi-modal contrastive learning …

Towards open vocabulary learning: A survey

J Wu, X Li, S Xu, H Yuan, H Ding… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
In the field of visual scene understanding, deep neural networks have made impressive
advancements in various core tasks like segmentation, tracking, and detection. However …

Distilling large vision-language model with out-of-distribution generalizability

X Li, Y Fang, M Liu, Z Ling, Z Tu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Large vision-language models have achieved outstanding performance, but their size and
computational requirements make their deployment on resource-constrained devices and …

Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation

Z Zhao, W Liu, X Chen, X Zeng… - Advances in …, 2024 - proceedings.neurips.cc
We present a novel alignment-before-generation approach to tackle the challenging task of
generating general 3D shapes based on 2D images or texts. Directly learning a conditional …

Multi3drefer: Grounding text description to multiple 3d objects

Y Zhang, ZM Gong, AX Chang - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
We introduce the task of localizing a flexible number of objects in real-world 3D scenes
using natural language descriptions. Existing 3D visual grounding tasks focus on localizing …

Pointgpt: Auto-regressively generative pre-training from point clouds

G Chen, M Wang, Y Yang, K Yu… - Advances in Neural …, 2024 - proceedings.neurips.cc
Large language models (LLMs) based on the generative pre-training transformer (GPT)
have demonstrated remarkable effectiveness across a diverse range of downstream tasks …

Real-world robot applications of foundation models: A review

K Kawaharazuka, T Matsushima… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent developments in foundation models, like Large Language Models (LLMs) and Vision-
Language Models (VLMs), trained on extensive data, facilitate flexible application across …