Fine-grained action retrieval through multiple parts-of-speech embeddings

M Wray, D Larlus, G Csurka… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
We address the problem of cross-modal fine-grained action retrieval between text and video.
Cross-modal retrieval is commonly achieved through learning a shared embedding space …

Rethinking zero-shot video classification: End-to-end training for realistic applications

B Brattoli, J Tighe, F Zhdanov… - Proceedings of the …, 2020 - openaccess.thecvf.com
Trained on large datasets, deep learning (DL) can accurately classify videos into hundreds
of diverse classes. However, video data is expensive to annotate. Zero-shot learning (ZSL) …

Audio-visual generalised zero-shot learning with cross-modal attention and language

OB Mercea, L Riesch, A Koepke… - Proceedings of the …, 2022 - openaccess.thecvf.com
Learning to classify video data from classes not included in the training data, ie video-based
zero-shot learning, is challenging. We conjecture that the natural alignment between the …

Cross-modal representation learning for zero-shot action recognition

CC Lin, K Lin, L Wang, Z Liu… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
We present a cross-modal Transformer-based framework, which jointly encodes video data
and text labels for zero-shot action recognition (ZSAR). Our model employs a conceptually …

Multi-modal zero-shot dynamic hand gesture recognition

R Rastgoo, K Kiani, S Escalera, M Sabokrou - Expert Systems with …, 2024 - Elsevier
Abstract Zero-Shot Learning (ZSL) has rapidly advanced in recent years. Towards
overcoming the annotation bottleneck in the Dynamic Hand Gesture Recognition (DHGR) …

Towards zero-shot sign language recognition

YC Bilge, RG Cinbis… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
This paper tackles the problem of zero-shot sign language recognition (ZSSLR), where the
goal is to leverage models learned over the seen sign classes to recognize the instances of …

Multimodal open-vocabulary video classification via pre-trained vision and language models

R Qian, Y Li, Z Xu, MH Yang, S Belongie… - arXiv preprint arXiv …, 2022 - arxiv.org
Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is
becoming a promising paradigm for open-vocabulary visual recognition. In this work, we …

Quo vadis, skeleton action recognition?

P Gupta, A Thatipelli, A Aggarwal… - International Journal of …, 2021 - Springer
In this paper, we study current and upcoming frontiers across the landscape of skeleton-
based human action recognition. To study skeleton-action recognition in the wild, we …

Zero-shot action recognition with transformer-based video semantic embedding

K Doshi, Y Yilmaz - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com
While video action recognition has been an active area of research for several years, zero-
shot action recognition has only recently started gaining traction. In this work, we propose a …

Alignment-uniformity aware representation learning for zero-shot video classification

S Pu, K Zhao, M Zheng - … of the IEEE/CVF Conference on …, 2022 - openaccess.thecvf.com
Most methods tackle zero-shot video classification by aligning visual-semantic
representations within seen classes, which limits generalization to unseen classes. To …