Compositional chain-of-thought prompting for large multimodal models

C Mitra, B Huang, T Darrell… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …

Video action transformer network

R Girdhar, J Carreira, C Doersch… - Proceedings of the …, 2019 - openaccess.thecvf.com
Abstract We introduce the Action Transformer model for recognizing and localizing human
actions in video clips. We repurpose a Transformer-style architecture to aggregate features …

Epic-kitchens visor benchmark: Video segmentations and object relations

A Darkhalil, D Shan, B Zhu, J Ma… - Advances in …, 2022 - proceedings.neurips.cc
We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for
segmenting hands and active objects in egocentric video. VISOR annotates videos from …

The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision

J Mao, C Gan, P Kohli, JB Tenenbaum, J Wu - arXiv preprint arXiv …, 2019 - arxiv.org
We propose the Neuro-Symbolic Concept Learner (NS-CL), a model that learns visual
concepts, words, and semantic parsing of sentences without explicit supervision on any of …

Action genome: Actions as compositions of spatio-temporal scene graphs

J Ji, R Krishna, L Fei-Fei… - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com
Action recognition has typically treated actions and activities as monolithic events that occur
in videos. However, there is evidence from Cognitive Science and Neuroscience that people …

Long-term feature banks for detailed video understanding

CY Wu, C Feichtenhofer, H Fan, K He… - Proceedings of the …, 2019 - openaccess.thecvf.com
To understand the world, we humans constantly need to relate the present to the past, and
put events in context. In this paper, we enable existing video models to do the same. We …

Understanding human hands in contact at internet scale

D Shan, J Geng, M Shu… - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com
Hands are the central means by which humans manipulate their world and being able to
reliably extract hand state information from Internet videos of humans engaged in their …

Large-scale weakly-supervised pre-training for video action recognition

D Ghadiyaram, D Tran… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
Current fully-supervised video datasets consist of only a few hundred thousand videos and
fewer than a thousand domain-specific labels. This hinders the progress towards advanced …

What makes training multi-modal classification networks hard?

W Wang, D Tran, M Feiszli - … of the IEEE/CVF conference on …, 2020 - openaccess.thecvf.com
Consider end-to-end training of a multi-modal vs. a uni-modal network on a task with
multiple input modalities: the multi-modal network receives more information, so it should …

Towards long-form video understanding

CY Wu, P Krahenbuhl - … of the IEEE/CVF Conference on …, 2021 - openaccess.thecvf.com
Our world offers a never-ending stream of visual stimuli, yet today's vision systems only
accurately recognize patterns within a few seconds. These systems understand the present …