Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Maskclip: Masked self-distillation advances contrastive language-image pretraining

X Dong, J Bao, Y Zheng, T Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
This paper presents a simple yet effective framework MaskCLIP, which incorporates a newly
proposed masked self-distillation into contrastive language-image pretraining. The core idea …

Compositional chain-of-thought prompting for large multimodal models

C Mitra, B Huang, T Darrell… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …

Med-unic: Unifying cross-lingual medical vision-language pre-training by diminishing bias

Z Wan, C Liu, M Zhang, J Fu, B Wang… - Advances in …, 2024 - proceedings.neurips.cc
The scarcity of data presents a critical obstacle to the efficacy of medical vision-language pre-
training (VLP). A potential solution lies in the combination of datasets from various language …

Sus-x: Training-free name-only transfer of vision-language models

V Udandarao, A Gupta… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet
effective way to train large-scale vision-language models. CLIP demonstrates impressive …

[HTML][HTML] DILF: Differentiable rendering-based multi-view Image–Language Fusion for zero-shot 3D shape understanding

X Ning, Z Yu, L Li, W Li, P Tiwari - Information Fusion, 2024 - Elsevier
Zero-shot 3D shape understanding aims to recognize “unseen” 3D categories that are not
present in training data. Recently, Contrastive Language–Image Pre-training (CLIP) has …

Teaching structured vision & language concepts to vision & language models

S Doveh, A Arbelle, S Harary… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in
a variety of tasks. However, some aspects of complex language understanding still remain a …

Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition

D Hegde, JMJ Valanarasu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Vision-Language models like CLIP have been widely adopted for various tasks due to their
impressive zero-shot capabilities. However, CLIP is not suitable for extracting 3D geometric …

Going beyond nouns with vision & language models using synthetic data

P Cascante-Bonilla, K Shehada… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large-scale pre-trained Vision & Language (VL) models have shown remarkable
performance in many applications, enabling replacing a fixed set of supported classes with …

Clipood: Generalizing clip to out-of-distributions

Y Shu, X Guo, J Wu, X Wang… - … on Machine Learning, 2023 - proceedings.mlr.press
Abstract Out-of-distribution (OOD) generalization, where the model needs to handle
distribution shifts from training, is a major challenge of machine learning. Contrastive …