F Liu, T Zhang, W Dai, W Cai, X Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org
Multi-modal (vision-language) models, such as CLIP, are replacing traditional supervised
pre-training models (eg, ImageNet-based pre-training) as the new generation of visual …