Contrastive language-image pre-training with knowledge graphs

X Pan, T Ye, D Han, S Song… - Advances in Neural …, 2022 - proceedings.neurips.cc
Recent years have witnessed the fast development of large-scale pre-training frameworks
that can extract multi-modal representations in a unified form and achieve promising …

Exposing and mitigating spurious correlations for cross-modal retrieval

JM Kim, A Koepke, C Schmid… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Cross-modal retrieval methods are the preferred tool to search databases for the text that
best matches a query image and vice versa However, image-text retrieval models commonly …

Cross-modal retrieval: a systematic review of methods and future directions

L Zhu, T Wang, F Li, J Li, Z Zhang, HT Shen - arXiv preprint arXiv …, 2023 - arxiv.org
With the exponential surge in diverse multi-modal data, traditional uni-modal retrieval
methods struggle to meet the needs of users demanding access to data from various …

CLIP-ReID: exploiting vision-language model for image re-identification without concrete text labels

S Li, L Sun, Q Li - Proceedings of the AAAI Conference on Artificial …, 2023 - ojs.aaai.org
Pre-trained vision-language models like CLIP have recently shown superior performances
on various downstream tasks, including image classification and segmentation. However, in …

Fame-vil: Multi-tasking vision-language model for heterogeneous fashion tasks

X Han, X Zhu, L Yu, L Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
In the fashion domain, there exists a variety of vision-and-language (V+ L) tasks, including
cross-modal retrieval, text-guided image retrieval, multi-modal classification, and image …

Carzero: Cross-attention alignment for radiology zero-shot classification

H Lai, Q Yao, Z Jiang, R Wang, Z He… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract The advancement of Zero-Shot Learning in the medical domain has been driven
forward by using pre-trained models on large-scale image-text pairs focusing on image-text …

Fashionsap: Symbols and attributes prompt for fine-grained fashion vision-language pre-training

Y Han, L Zhang, Q Chen, Z Chen, Z Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
Fashion vision-language pre-training models have shown efficacy for a wide range of
downstream tasks. However, general vision-language pre-training models pay less attention …

Aesclip: Multi-attribute contrastive learning for image aesthetics assessment

X Sheng, L Li, P Chen, J Wu, W Dong, Y Yang… - Proceedings of the 31st …, 2023 - dl.acm.org
Image aesthetics assessment (IAA) aims at predicting the aesthetic quality of images.
Recently, large pre-trained vision-language models, like CLIP, have shown impressive …

Lidarclip or: How i learned to talk to point clouds

G Hess, A Tonderski, C Petersson… - Proceedings of the …, 2024 - openaccess.thecvf.com
Research connecting text and images has recently seen several breakthroughs, with models
like CLIP, DALL* E 2, and Stable Diffusion. However, the connection between text and other …

Representation recovering for self-supervised pre-training on medical images

X Yan, J Naushad, S Sun, K Han… - Proceedings of the …, 2023 - openaccess.thecvf.com
Advances in self-supervised learning, especially in contrastive learning, have drawn
attention to investigating these techniques in providing effective visual representations from …