Human activity recognition (HAR) is one of the most important and challenging problems in the computer vision. It has critical application in wide variety of tasks including gaming …
Large pretrained (eg," foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely …
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …
We introduce LAVILA, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be …
Building a robot that can understand and learn to interact by watching humans has inspired several vision problems. However, despite some successful results on static datasets, it …
Abstract We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …
R Girdhar, K Grauman - Proceedings of the IEEE/CVF …, 2021 - openaccess.thecvf.com
Abstract We propose Anticipative Video Transformer (AVT), an end-to-end attention-based video modeling architecture that attends to the previously observed video in order to …
Eye behaviour provides valuable information revealing one's higher cognitive functions and state of affect. Although eye tracking is gaining ground in the research community, it is not …
Video classification is computationally expensive. In this paper, we address theproblem of frame selection to reduce the computational cost of video classification. Recent work has …