Action2vec: A crossmodal embedding approach to action learning

M Hahn, A Silva, JM Rehg - arXiv preprint arXiv:1901.00484, 2019 - arxiv.org
We describe a novel cross-modal embedding space for actions, named Action2Vec, which
combines linguistic cues from class labels with spatio-temporal features derived from video …

Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition

KY Lin, H Ding, J Zhou, YX Peng, Z Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
Contrastive Language-Image Pretraining (CLIP) has shown remarkable open-vocabulary
abilities across various image understanding tasks. Building upon this impressive success …

Home action genome: Cooperative compositional action understanding

N Rai, H Chen, J Ji, R Desai… - Proceedings of the …, 2021 - openaccess.thecvf.com
Existing research on action recognition treats activities as monolithic events occurring in
videos. Recently, the benefits of formulating actions as a combination of atomic-actions have …

Open set action recognition via multi-label evidential learning

C Zhao, D Du, A Hoogs, C Funk - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Existing methods for open set action recognition focus on novelty detection that assumes
video clips show a single action, which is unrealistic in the real world. We propose a new …

Promptonomyvit: Multi-task prompt learning improves video transformers using synthetic scene data

R Herzig, O Abramovich… - Proceedings of the …, 2024 - openaccess.thecvf.com
Action recognition models have achieved impressive results by incorporating scene-level
annotations, such as objects, their relations, 3D structure, and more. However, obtaining …

Learn2augment: learning to composite videos for data augmentation in action recognition

SN Gowda, M Rohrbach, F Keller… - European conference on …, 2022 - Springer
We address the problem of data augmentation for video action recognition. Standard
augmentation strategies in video are hand-designed and sample the space of possible …

More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation

Q Fan, CFR Chen, H Kuehne… - Advances in Neural …, 2019 - proceedings.neurips.cc
Current state-of-the-art models for video action recognition are mostly based on expensive
3D ConvNets. This results in a need for large GPU clusters to train and evaluate such …

Intra-and inter-action understanding via temporal action parsing

D Shao, Y Zhao, B Dai, D Lin - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com
Current methods for action recognition primarily rely on deep convolutional networks to
derive feature embeddings of visual and motion features. While these methods have …

Large-scale weakly-supervised pre-training for video action recognition

D Ghadiyaram, D Tran… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
Current fully-supervised video datasets consist of only a few hundred thousand videos and
fewer than a thousand domain-specific labels. This hinders the progress towards advanced …

Actionclip: A new paradigm for video action recognition

M Wang, J Xing, Y Liu - arXiv preprint arXiv:2109.08472, 2021 - arxiv.org
The canonical approach to video action recognition dictates a neural model to do a classic
and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined …