H Bansal, Y Bitton, I Szpektor, KW Chang… - arXiv preprint arXiv …, 2023 - arxiv.org
Despite being (pre) trained on a massive amount of data, state-of-the-art video-language
alignment models are not robust to semantically-plausible contrastive changes in the video …