Y Oh, P Ahn, J Kim, G Song, S Lee,
IS Kweon… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision and language models (VLMs) such as CLIP have showcased remarkable zero-shot
recognition abilities yet face challenges in visio-linguistic compositionality, particularly in …