represent its relevant image features (eg, from dog photos). Also, a recent study has
demonstrated the cross-modal transferability phenomenon of this joint space. From these
observations, we propose PromptStyler which simulates various distribution shifts in the joint
space by synthesizing diverse styles via prompts without using any images to deal with
source-free domain generalization. The proposed method learns to generate a variety of …
J Cho,
G Nam, S Kim, H Yang,
S Kwak - openaccess.thecvf.com
We choose CLIP [13] as our pre-trained vision-language model which is a large-scale model
trained with 400 million image-text pairs. Note that the proposed method is broadly
applicable to the CLIP-like vision-language models [7, 16] which also construct
hyperspherical joint vision-language spaces using contrastive learning methods. Given a
batch of image-text pairs, such models jointly train an image encoder and a text encoder
considering similarity scores obtained from image-text pairings. Joint vision-language …