C Schuhmann, R Vencu, R Beaumont… - arXiv preprint arXiv …, 2021 - arxiv.org
Multi-modal language-vision models trained on hundreds of millions of image-text pairs (eg
CLIP, DALL-E) gained a recent surge, showing remarkable capability to perform zero-or few …