Vision-language models for vision tasks: A survey

J Zhang, J Huang, S Jin, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …

Alip: Adaptive language-image pre-training with synthetic caption

K Yang, J Deng, X An, J Li, Z Feng… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Contrastive Language-Image Pre-training (CLIP) has significantly boosted the
performance of various vision-language tasks by scaling up the dataset with image-text pairs …

Learning visual representations via language-guided sampling

M El Banani, K Desai… - Proceedings of the ieee …, 2023 - openaccess.thecvf.com
Although an object may appear in numerous contexts, we often describe it in a limited
number of ways. Language allows us to abstract away visual variation to represent and …

Dreamlip: Language-image pre-training with long captions

K Zheng, Y Zhang, W Wu, F Lu, S Ma, X Jin… - … on Computer Vision, 2025 - Springer
Abstract Language-image pre-training largely relies on how precisely and thoroughly a text
describes its paired image. In practice, however, the contents of an image can be so rich that …

Badclip: Dual-embedding guided backdoor attack on multimodal contrastive learning

S Liang, M Zhu, A Liu, B Wu, X Cao… - Proceedings of the …, 2024 - openaccess.thecvf.com
While existing backdoor attacks have successfully infected multimodal contrastive learning
models such as CLIP they can be easily countered by specialized backdoor defenses for …

Imitate: Clinical prior guided hierarchical vision-language pre-training

C Liu, S Cheng, M Shi, A Shah, W Bai… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
In the field of medical Vision-Language Pretraining (VLP), significant efforts have been
devoted to deriving text and image features from both clinical reports and associated …

Learning customized visual models with retrieval-augmented knowledge

H Liu, K Son, J Yang, C Liu, J Gao… - Proceedings of the …, 2023 - openaccess.thecvf.com
Image-text contrastive learning models such as CLIP have demonstrated strong task transfer
ability. The high generality and usability of these visual models is achieved via a web-scale …

Misalign, contrast then distill: Rethinking misalignments in language-image pre-training

B Kim, Y Jo, J Kim, S Kim - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Abstract Contrastive Language-Image Pretraining has emerged as a prominent approach for
training vision and text encoders with uncurated image-text pairs from the web. To enhance …

Heterogeneous contrastive learning for foundation models and beyond

L Zheng, B Jing, Z Li, H Tong, J He - Proceedings of the 30th ACM …, 2024 - dl.acm.org
In the era of big data and Artificial Intelligence, an emerging paradigm is to utilize contrastive
self-supervised learning to model large-scale heterogeneous data. Many existing foundation …

Non-contrastive learning meets language-image pre-training

J Zhou, L Dong, Z Gan, L Wang… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Contrastive language-image pre-training (CLIP) serves as a de-facto standard to align
images and texts. Nonetheless, the loose correlation between images and texts of web …