representation structure and image space. Each layer comprises a set of tokens arranged"
on-the-grid," which biases patches or tokens to encode information at a specific spatio (-
temporal) location. In this work we present Moving Off-the-Grid (MooG), a self-supervised
video representation model that offers an alternative approach, allowing tokens to move" off-
the-grid" to better enable them to represent scene elements consistently, even as they move …