S Buch, C Eyzaguirre, A Gaidon, J Wu, L Fei-Fei… - arXiv preprint arXiv …, 2022 - arxiv.org
What makes a video task uniquely suited for videos, beyond what can be understood from a
single image? Building on recent progress in self-supervised image-language models, we …