Human action recognition from various data modalities: A review

Z Sun, Q Ke, H Rahmani, M Bennamoun… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …

St-adapter: Parameter-efficient image-to-video transfer learning

J Pan, Z Lin, X Zhu, J Shao, H Li - Advances in Neural …, 2022 - proceedings.neurips.cc
Capitalizing on large pre-trained models for various downstream tasks of interest have
recently emerged with promising performance. Due to the ever-growing model size, the …

Uniformer: Unifying convolution and self-attention for visual recognition

K Li, Y Wang, J Zhang, P Gao, G Song… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org
It is a challenging task to learn discriminative representation from images and videos, due to
large local redundancy and complex global dependency in these visual data. Convolution …

Prompting visual-language models for efficient video understanding

C Ju, T Han, K Zheng, Y Zhang, W Xie - European Conference on …, 2022 - Springer
Image-based visual-language (I-VL) pre-training has shown great success for learning joint
visual-textual representations from large-scale web data, revealing remarkable ability for …

Edgevits: Competing light-weight cnns on mobile devices with vision transformers

J Pan, A Bulat, F Tan, X Zhu, L Dudziak, H Li… - … on Computer Vision, 2022 - Springer
Self-attention based models such as vision transformers (ViTs) have emerged as a very
competitive architecture alternative to convolutional neural networks (CNNs) in computer …

Transformer meets remote sensing video detection and tracking: A comprehensive survey

L Jiao, X Zhang, X Liu, F Liu, S Yang… - IEEE Journal of …, 2023 - ieeexplore.ieee.org
Transformer has shown excellent performance in remote sensing field with long-range
modeling capabilities. Remote sensing video (RSV) moving object detection and tracking …

Ts2-net: Token shift and selection transformer for text-video retrieval

Y Liu, P Xiong, L Xu, S Cao, Q Jin - European conference on computer …, 2022 - Springer
Text-Video retrieval is a task of great practical value and has received increasing attention,
among which learning spatial-temporal video representation is one of the research hotspots …

Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning

R Wang, D Chen, Z Wu, Y Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com
Benefiting from masked visual modeling, self-supervised video representation learning has
achieved remarkable progress. However, existing methods focus on learning …

Physformer: Facial video-based physiological measurement with temporal difference transformer

Z Yu, Y Shen, J Shi, H Zhao… - Proceedings of the …, 2022 - openaccess.thecvf.com
Remote photoplethysmography (rPPG), which aims at measuring heart activities and
physiological signals from facial video without any contact, has great potential in many …

Adaptive token sampling for efficient vision transformers

M Fayyaz, SA Koohpayegani, FR Jafari… - … on Computer Vision, 2022 - Springer
While state-of-the-art vision transformer models achieve promising results in image
classification, they are computationally expensive and require many GFLOPs. Although the …