Visual tuning

BXB Yu, J Chang, H Wang, L Liu, S Wang… - ACM Computing …, 2023 - dl.acm.org
Fine-tuning visual models has been widely shown promising performance on many
downstream visual tasks. With the surprising development of pre-trained visual foundation …

Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders

R Zhang, L Wang, Y Qiao, P Gao… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Pre-training by numerous image data has become de-facto for robust 2D representations. In
contrast, due to the expensive data processing, a paucity of 3D datasets severely hinders …

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

L Xue, N Yu, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recent advancements in multimodal pre-training have shown promising efficacy in 3D
representation learning by aligning multimodal features across 3D shapes their 2D …

Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining

Z Qi, R Dong, G Fan, Z Ge, X Zhang… - … on Machine Learning, 2023 - proceedings.mlr.press
Mainstream 3D representation learning approaches are built upon contrastive or generative
modeling pretext tasks, where great improvements in performance on various downstream …

Dreamllm: Synergistic multimodal comprehension and creation

R Dong, C Han, Y Peng, Z Qi, Z Ge, J Yang… - arXiv preprint arXiv …, 2023 - arxiv.org
This paper presents DreamLLM, a learning framework that first achieves versatile
Multimodal Large Language Models (MLLMs) empowered with frequently overlooked …

Pointgpt: Auto-regressively generative pre-training from point clouds

G Chen, M Wang, Y Yang, K Yu… - Advances in Neural …, 2024 - proceedings.neurips.cc
Large language models (LLMs) based on the generative pre-training transformer (GPT)
have demonstrated remarkable effectiveness across a diverse range of downstream tasks …

Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following

Z Guo, R Zhang, X Zhu, Y Tang, X Ma, J Han… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image,
language, audio, and video. Guided by ImageBind, we construct a joint embedding space …

Clip-fo3d: Learning free open-world 3d scene representations from 2d dense clip

J Zhang, R Dong, K Ma - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Training a 3D scene understanding model requires complicated human annotations, which
are laborious to collect and result in a model only encoding close-set object semantics. In …

Three pillars improving vision foundation model distillation for lidar

G Puy, S Gidaris, A Boulch, O Siméoni… - Proceedings of the …, 2024 - openaccess.thecvf.com
Self-supervised image backbones can be used to address complex 2D tasks (eg semantic
segmentation object discovery) very efficiently and with little or no downstream supervision …

Swin3d: A pretrained transformer backbone for 3d indoor scene understanding

YQ Yang, YX Guo, JY Xiong, Y Liu, H Pan… - arXiv preprint arXiv …, 2023 - arxiv.org
The use of pretrained backbones with fine-tuning has been successful for 2D vision and
natural language processing tasks, showing advantages over task-specific networks. In this …