In the eye of beholder: Joint learning of gaze and actions in first person video

F Gu, MH Chung, M Chignell, S Valaee… - ACM Computing …, 2021 - dl.acm.org

Human activity recognition is a key to a lot of applications such as healthcare and smart
home. In this study, we provide a comprehensive survey on recent advances and challenges …

被引用次数：157 相关文章所有 3 个版本

[PDF] edgehill.ac.uk

A review of multimodal human activity recognition with special emphasis on classification, applications, challenges and future directions

SK Yadav, K Tiwari, HM Pandey, SA Akbar - Knowledge-Based Systems, 2021 - Elsevier

Human activity recognition (HAR) is one of the most important and challenging problems in
the computer vision. It has critical application in wide variety of tasks including gaming …

被引用次数：189 相关文章所有 3 个版本

[PDF] arxiv.org

Socratic models: Composing zero-shot multimodal reasoning with language

A Zeng, M Attarian, B Ichter, K Choromanski… - arXiv preprint arXiv …, 2022 - arxiv.org

Large pretrained (eg," foundation") models exhibit distinct capabilities depending on the
domain of data they are trained on. While these domains are generic, they may only barely …

被引用次数：383 相关文章所有 6 个版本

[PDF] thecvf.com

Ego4d: Around the world in 3,000 hours of egocentric video

K Grauman, A Westbury, E Byrne… - Proceedings of the …, 2022 - openaccess.thecvf.com

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …

被引用次数：657 相关文章所有 13 个版本

[PDF] thecvf.com

Learning video representations from large language models

Y Zhao, I Misra, P Krähenbühl… - Proceedings of the …, 2023 - openaccess.thecvf.com

We introduce LAVILA, a new approach to learning video-language representations by
leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be …

被引用次数：103 相关文章所有 7 个版本

[PDF] thecvf.com

Affordances from human videos as a versatile representation for robotics

S Bahl, R Mendonca, L Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com

Building a robot that can understand and learn to interact by watching humans has inspired
several vision problems. However, despite some successful results on static datasets, it …

被引用次数：62 相关文章所有 9 个版本

[PDF] thecvf.com

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

K Grauman, A Westbury, L Torresani… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …

被引用次数：34 相关文章所有 5 个版本

[PDF] thecvf.com

Anticipative video transformer

R Girdhar, K Grauman - Proceedings of the IEEE/CVF …, 2021 - openaccess.thecvf.com

Abstract We propose Anticipative Video Transformer (AVT), an end-to-end attention-based
video modeling architecture that attends to the previously observed video in order to …

被引用次数：200 相关文章所有 6 个版本

[PDF] academia.edu

Review of eye tracking metrics involved in emotional and cognitive processes

V Skaramagkas, G Giannakakis… - IEEE Reviews in …, 2021 - ieeexplore.ieee.org

Eye behaviour provides valuable information revealing one's higher cognitive functions and
state of affect. Although eye tracking is gaining ground in the research community, it is not …

被引用次数：143 相关文章所有 5 个版本

[PDF] aaai.org

Smart frame selection for action recognition

SN Gowda, M Rohrbach, L Sevilla-Lara - Proceedings of the AAAI …, 2021 - ojs.aaai.org

Video classification is computationally expensive. In this paper, we address theproblem of
frame selection to reduce the computational cost of video classification. Recent work has …

被引用次数：152 相关文章所有 6 个版本

高级搜索

QQ 群