Autoencoders as cross-modal teachers: Can pretrained 2d image transformers help 3d representation...

BXB Yu, J Chang, H Wang, L Liu, S Wang… - ACM Computing …, 2023 - dl.acm.org

Fine-tuning visual models has been widely shown promising performance on many
downstream visual tasks. With the surprising development of pre-trained visual foundation …

被引用次数：19 相关文章所有 4 个版本

[PDF] thecvf.com

Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders

R Zhang, L Wang, Y Qiao, P Gao… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Pre-training by numerous image data has become de-facto for robust 2D representations. In
contrast, due to the expensive data processing, a paucity of 3D datasets severely hinders …

被引用次数：93 相关文章所有 5 个版本

[PDF] thecvf.com

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

L Xue, N Yu, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recent advancements in multimodal pre-training have shown promising efficacy in 3D
representation learning by aligning multimodal features across 3D shapes their 2D …

被引用次数：60 相关文章所有 3 个版本

[PDF] mlr.press

Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining

Z Qi, R Dong, G Fan, Z Ge, X Zhang… - … on Machine Learning, 2023 - proceedings.mlr.press

Mainstream 3D representation learning approaches are built upon contrastive or generative
modeling pretext tasks, where great improvements in performance on various downstream …

被引用次数：70 相关文章所有 6 个版本

[PDF] arxiv.org

Dreamllm: Synergistic multimodal comprehension and creation

R Dong, C Han, Y Peng, Z Qi, Z Ge, J Yang… - arXiv preprint arXiv …, 2023 - arxiv.org

This paper presents DreamLLM, a learning framework that first achieves versatile
Multimodal Large Language Models (MLLMs) empowered with frequently overlooked …

被引用次数：61 相关文章所有 4 个版本

[PDF] neurips.cc

Pointgpt: Auto-regressively generative pre-training from point clouds

G Chen, M Wang, Y Yang, K Yu… - Advances in Neural …, 2024 - proceedings.neurips.cc

Large language models (LLMs) based on the generative pre-training transformer (GPT)
have demonstrated remarkable effectiveness across a diverse range of downstream tasks …

被引用次数：40 相关文章所有 5 个版本

[PDF] arxiv.org

Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following

Z Guo, R Zhang, X Zhu, Y Tang, X Ma, J Han… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image,
language, audio, and video. Guided by ImageBind, we construct a joint embedding space …

被引用次数：55 相关文章所有 3 个版本

[PDF] thecvf.com

Clip-fo3d: Learning free open-world 3d scene representations from 2d dense clip

J Zhang, R Dong, K Ma - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Training a 3D scene understanding model requires complicated human annotations, which
are laborious to collect and result in a model only encoding close-set object semantics. In …

被引用次数：39 相关文章所有 6 个版本

[PDF] thecvf.com

Three pillars improving vision foundation model distillation for lidar

G Puy, S Gidaris, A Boulch, O Siméoni… - Proceedings of the …, 2024 - openaccess.thecvf.com

Self-supervised image backbones can be used to address complex 2D tasks (eg semantic
segmentation object discovery) very efficiently and with little or no downstream supervision …

被引用次数：5 相关文章

[PDF] arxiv.org

Swin3d: A pretrained transformer backbone for 3d indoor scene understanding

YQ Yang, YX Guo, JY Xiong, Y Liu, H Pan… - arXiv preprint arXiv …, 2023 - arxiv.org

The use of pretrained backbones with fine-tuning has been successful for 2D vision and
natural language processing tasks, showing advantages over task-specific networks. In this …

被引用次数：42 相关文章所有 2 个版本

高级搜索

QQ 群