Human visual perception carves a scene at its physical joints, decomposing the world into objects, which are selectively attended, tracked and predicted as we engage our …
What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we …
The visual world can be parsimoniously characterized in terms of distinct entities with sparse interactions. Discovering this compositional structure in dynamic visual scenes has proven …
Abstract 'Intuitive physics' enables our pragmatic engagement with the physical world and forms a key component of 'common sense'aspects of thought. Current artificial intelligence …
R Girdhar, K Grauman - Proceedings of the IEEE/CVF …, 2021 - openaccess.thecvf.com
Abstract We propose Anticipative Video Transformer (AVT), an end-to-end attention-based video modeling architecture that attends to the previously observed video in order to …
Object-centric representations are a promising path toward more systematic generalization by providing flexible abstractions upon which compositional world models can be built …
G Singh, YF Wu, S Ahn - Advances in Neural Information …, 2022 - proceedings.neurips.cc
Unsupervised object-centric learning aims to represent the modular, compositional, and causal structure of a scene as a set of object representations and thereby promises to …
Recently, tracking-by-detection has become a popular paradigm in Multiple-object tracking (MOT) for its concise pipeline. Many current works first associate the detections to form track …
Reasoning in the real world is not divorced from situations. How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and …