CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning

E Davis - ACM Computing Surveys, 2023 - dl.acm.org

More than one hundred benchmarks have been developed to test the commonsense
knowledge and commonsense reasoning abilities of artificial intelligence (AI) systems …

被引用次数：54 相关文章所有 4 个版本

[PDF] arxiv.org

Capturing the objects of vision with neural networks

B Peters, N Kriegeskorte - Nature human behaviour, 2021 - nature.com

Human visual perception carves a scene at its physical joints, decomposing the world into
objects, which are selectively attended, tracked and predicted as we engage our …

被引用次数：66 相关文章所有 12 个版本

[PDF] thecvf.com

Revisiting the" video" in video-language understanding

S Buch, C Eyzaguirre, A Gaidon, J Wu… - Proceedings of the …, 2022 - openaccess.thecvf.com

What makes a video task uniquely suited for videos, beyond what can be understood from a
single image? Building on recent progress in self-supervised image-language models, we …

被引用次数：170 相关文章所有 6 个版本

[PDF] neurips.cc

Savi++: Towards end-to-end object-centric learning from real-world videos

G Elsayed, A Mahendran… - Advances in …, 2022 - proceedings.neurips.cc

The visual world can be parsimoniously characterized in terms of distinct entities with sparse
interactions. Discovering this compositional structure in dynamic visual scenes has proven …

被引用次数：140 相关文章所有 7 个版本

[PDF] nature.com

Intuitive physics learning in a deep-learning model inspired by developmental psychology

LS Piloto, A Weinstein, P Battaglia… - Nature human …, 2022 - nature.com

Abstract 'Intuitive physics' enables our pragmatic engagement with the physical world and
forms a key component of 'common sense'aspects of thought. Current artificial intelligence …

被引用次数：115 相关文章所有 9 个版本

[PDF] thecvf.com

Anticipative video transformer

R Girdhar, K Grauman - Proceedings of the IEEE/CVF …, 2021 - openaccess.thecvf.com

Abstract We propose Anticipative Video Transformer (AVT), an end-to-end attention-based
video modeling architecture that attends to the previously observed video in order to …

被引用次数：244 相关文章所有 6 个版本

[PDF] arxiv.org

Conditional object-centric learning from video

T Kipf, GF Elsayed, A Mahendran, A Stone… - arXiv preprint arXiv …, 2021 - arxiv.org

Object-centric representations are a promising path toward more systematic generalization
by providing flexible abstractions upon which compositional world models can be built …

被引用次数：215 相关文章所有 3 个版本

[PDF] neurips.cc

Simple unsupervised object-centric learning for complex and naturalistic videos

G Singh, YF Wu, S Ahn - Advances in Neural Information …, 2022 - proceedings.neurips.cc

Unsupervised object-centric learning aims to represent the modular, compositional, and
causal structure of a scene as a set of object representations and thereby promises to …

被引用次数：113 相关文章所有 7 个版本

[PDF] ieee.org

Extendable multiple nodes recurrent tracking framework with RTU++

S Wang, H Sheng, D Yang, Y Zhang… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org

Recently, tracking-by-detection has become a popular paradigm in Multiple-object tracking
(MOT) for its concise pipeline. Many current works first associate the detections to form track …

被引用次数：94 相关文章所有 4 个版本

[PDF] arxiv.org

Star: A benchmark for situated reasoning in real-world videos

B Wu, S Yu, Z Chen, JB Tenenbaum, C Gan - arXiv preprint arXiv …, 2024 - arxiv.org

Reasoning in the real world is not divorced from situations. How to capture the present
knowledge from surrounding situations and perform reasoning accordingly is crucial and …

被引用次数：160 相关文章所有 8 个版本

高级搜索

QQ 群