Uniclip: Unified framework for contrastive language-image pre-training

J Zhang, J Huang, S Jin, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …

被引用次数：369 相关文章所有 9 个版本

[PDF] thecvf.com

Alip: Adaptive language-image pre-training with synthetic caption

K Yang, J Deng, X An, J Li, Z Feng… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Contrastive Language-Image Pre-training (CLIP) has significantly boosted the
performance of various vision-language tasks by scaling up the dataset with image-text pairs …

被引用次数：44 相关文章所有 6 个版本

[PDF] thecvf.com

Learning visual representations via language-guided sampling

M El Banani, K Desai… - Proceedings of the ieee …, 2023 - openaccess.thecvf.com

Although an object may appear in numerous contexts, we often describe it in a limited
number of ways. Language allows us to abstract away visual variation to represent and …

被引用次数：32 相关文章所有 7 个版本

[PDF] arxiv.org

Dreamlip: Language-image pre-training with long captions

K Zheng, Y Zhang, W Wu, F Lu, S Ma, X Jin… - … on Computer Vision, 2025 - Springer

Abstract Language-image pre-training largely relies on how precisely and thoroughly a text
describes its paired image. In practice, however, the contents of an image can be so rich that …

被引用次数：19 相关文章所有 2 个版本

[PDF] thecvf.com

Badclip: Dual-embedding guided backdoor attack on multimodal contrastive learning

S Liang, M Zhu, A Liu, B Wu, X Cao… - Proceedings of the …, 2024 - openaccess.thecvf.com

While existing backdoor attacks have successfully infected multimodal contrastive learning
models such as CLIP they can be easily countered by specialized backdoor defenses for …

被引用次数：41 相关文章所有 3 个版本

[PDF] arxiv.org

Imitate: Clinical prior guided hierarchical vision-language pre-training

C Liu, S Cheng, M Shi, A Shah, W Bai… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

In the field of medical Vision-Language Pretraining (VLP), significant efforts have been
devoted to deriving text and image features from both clinical reports and associated …

被引用次数：20 相关文章所有 2 个版本

[PDF] thecvf.com

Learning customized visual models with retrieval-augmented knowledge

H Liu, K Son, J Yang, C Liu, J Gao… - Proceedings of the …, 2023 - openaccess.thecvf.com

Image-text contrastive learning models such as CLIP have demonstrated strong task transfer
ability. The high generality and usability of these visual models is achieved via a web-scale …

被引用次数：23 相关文章所有 5 个版本

[PDF] thecvf.com

Misalign, contrast then distill: Rethinking misalignments in language-image pre-training

B Kim, Y Jo, J Kim, S Kim - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Abstract Contrastive Language-Image Pretraining has emerged as a prominent approach for
training vision and text encoders with uncurated image-text pairs from the web. To enhance …

被引用次数：5 相关文章所有 5 个版本

[PDF] acm.org

Heterogeneous contrastive learning for foundation models and beyond

L Zheng, B Jing, Z Li, H Tong, J He - Proceedings of the 30th ACM …, 2024 - dl.acm.org

In the era of big data and Artificial Intelligence, an emerging paradigm is to utilize contrastive
self-supervised learning to model large-scale heterogeneous data. Many existing foundation …

被引用次数：11 相关文章所有 2 个版本

[PDF] thecvf.com

Non-contrastive learning meets language-image pre-training

J Zhou, L Dong, Z Gan, L Wang… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Contrastive language-image pre-training (CLIP) serves as a de-facto standard to align
images and texts. Nonetheless, the loose correlation between images and texts of web …

被引用次数：23 相关文章所有 5 个版本

高级搜索

QQ 群