Object level visual reasoning in videos

C Mitra, B Huang, T Darrell… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …

被引用次数：20 相关文章所有 3 个版本

[PDF] thecvf.com

Video action transformer network

R Girdhar, J Carreira, C Doersch… - Proceedings of the …, 2019 - openaccess.thecvf.com

Abstract We introduce the Action Transformer model for recognizing and localizing human
actions in video clips. We repurpose a Transformer-style architecture to aggregate features …

被引用次数：842 相关文章所有 11 个版本

[PDF] neurips.cc

Epic-kitchens visor benchmark: Video segmentations and object relations

A Darkhalil, D Shan, B Zhu, J Ma… - Advances in …, 2022 - proceedings.neurips.cc

We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for
segmenting hands and active objects in egocentric video. VISOR annotates videos from …

被引用次数：73 相关文章所有 7 个版本

[PDF] arxiv.org

The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision

J Mao, C Gan, P Kohli, JB Tenenbaum, J Wu - arXiv preprint arXiv …, 2019 - arxiv.org

We propose the Neuro-Symbolic Concept Learner (NS-CL), a model that learns visual
concepts, words, and semantic parsing of sentences without explicit supervision on any of …

被引用次数：781 相关文章所有 6 个版本

[PDF] thecvf.com

Action genome: Actions as compositions of spatio-temporal scene graphs

J Ji, R Krishna, L Fei-Fei… - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com

Action recognition has typically treated actions and activities as monolithic events that occur
in videos. However, there is evidence from Cognitive Science and Neuroscience that people …

被引用次数：354 相关文章所有 9 个版本

[PDF] thecvf.com

Long-term feature banks for detailed video understanding

CY Wu, C Feichtenhofer, H Fan, K He… - Proceedings of the …, 2019 - openaccess.thecvf.com

To understand the world, we humans constantly need to relate the present to the past, and
put events in context. In this paper, we enable existing video models to do the same. We …

被引用次数：561 相关文章所有 10 个版本

[PDF] thecvf.com

Understanding human hands in contact at internet scale

D Shan, J Geng, M Shu… - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com

Hands are the central means by which humans manipulate their world and being able to
reliably extract hand state information from Internet videos of humans engaged in their …

被引用次数：287 相关文章所有 12 个版本

[PDF] thecvf.com

Large-scale weakly-supervised pre-training for video action recognition

D Ghadiyaram, D Tran… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com

Current fully-supervised video datasets consist of only a few hundred thousand videos and
fewer than a thousand domain-specific labels. This hinders the progress towards advanced …

被引用次数：369 相关文章所有 8 个版本

[PDF] thecvf.com

What makes training multi-modal classification networks hard?

W Wang, D Tran, M Feiszli - … of the IEEE/CVF conference on …, 2020 - openaccess.thecvf.com

Consider end-to-end training of a multi-modal vs. a uni-modal network on a task with
multiple input modalities: the multi-modal network receives more information, so it should …

被引用次数：366 相关文章所有 8 个版本

[PDF] thecvf.com

Towards long-form video understanding

CY Wu, P Krahenbuhl - … of the IEEE/CVF Conference on …, 2021 - openaccess.thecvf.com

Our world offers a never-ending stream of visual stimuli, yet today's vision systems only
accurately recognize patterns within a few seconds. These systems understand the present …

被引用次数：140 相关文章所有 12 个版本

高级搜索

QQ 群