Cross-modal retrieval methods are the preferred tool to search databases for the text that best matches a query image and vice versa However, image-text retrieval models commonly …
With the exponential surge in diverse multi-modal data, traditional uni-modal retrieval methods struggle to meet the needs of users demanding access to data from various …
S Li, L Sun, Q Li - Proceedings of the AAAI Conference on Artificial …, 2023 - ojs.aaai.org
Pre-trained vision-language models like CLIP have recently shown superior performances on various downstream tasks, including image classification and segmentation. However, in …
In the fashion domain, there exists a variety of vision-and-language (V+ L) tasks, including cross-modal retrieval, text-guided image retrieval, multi-modal classification, and image …
Abstract The advancement of Zero-Shot Learning in the medical domain has been driven forward by using pre-trained models on large-scale image-text pairs focusing on image-text …
Fashion vision-language pre-training models have shown efficacy for a wide range of downstream tasks. However, general vision-language pre-training models pay less attention …
Image aesthetics assessment (IAA) aims at predicting the aesthetic quality of images. Recently, large pre-trained vision-language models, like CLIP, have shown impressive …
Research connecting text and images has recently seen several breakthroughs, with models like CLIP, DALL* E 2, and Stable Diffusion. However, the connection between text and other …
Advances in self-supervised learning, especially in contrastive learning, have drawn attention to investigating these techniques in providing effective visual representations from …