Human action recognition from various data modalities: A review

Z Sun, Q Ke, H Rahmani, M Bennamoun… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …

Videomamba: State space model for efficient video understanding

K Li, X Li, Y Wang, Y He, Y Wang, L Wang… - European Conference on …, 2024 - Springer
Addressing the dual challenges of local redundancy and global dependencies in video
understanding, this work innovatively adapts the Mamba to the video domain. The proposed …

St-adapter: Parameter-efficient image-to-video transfer learning

J Pan, Z Lin, X Zhu, J Shao, H Li - Advances in Neural …, 2022 - proceedings.neurips.cc
Capitalizing on large pre-trained models for various downstream tasks of interest have
recently emerged with promising performance. Due to the ever-growing model size, the …

Uniformer: Unifying convolution and self-attention for visual recognition

K Li, Y Wang, J Zhang, P Gao, G Song… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org
It is a challenging task to learn discriminative representation from images and videos, due to
large local redundancy and complex global dependency in these visual data. Convolution …

Tdn: Temporal difference networks for efficient action recognition

L Wang, Z Tong, B Ji, G Wu - Proceedings of the IEEE/CVF …, 2021 - openaccess.thecvf.com
Temporal modeling still remains challenging for action recognition in videos. To mitigate this
issue, this paper presents a new video architecture, termed as Temporal Difference Network …

Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges

BN Patro, VS Agneeswaran - arXiv preprint arXiv:2404.16112, 2024 - arxiv.org
Sequence modeling is a crucial area across various domains, including Natural Language
Processing (NLP), speech recognition, time series forecasting, music generation, and …

Uniformerv2: Unlocking the potential of image vits for video understanding

K Li, Y Wang, Y He, Y Li, Y Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
The prolific performances of Vision Transformers (ViTs) in image tasks have prompted
research into adapting the image ViTs for video tasks. However, the substantial gap …

Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer

K Li, Y Wang, Y He, Y Li, Y Wang, L Wang… - arXiv preprint arXiv …, 2022 - arxiv.org
Learning discriminative spatiotemporal representation is the key problem of video
understanding. Recently, Vision Transformers (ViTs) have shown their power in learning …

A cooperative vehicle-infrastructure system for road hazards detection with edge intelligence

C Chen, G Yao, L Liu, Q Pei, H Song… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Road hazards (RH) have always been the cause of many serious traffic accidents. These
have posed a threat to the safety of drivers, passengers, and pedestrians, and have also …

Diversifying spatial-temporal perception for video domain generalization

KY Lin, JR Du, Y Gao, J Zhou… - Advances in Neural …, 2024 - proceedings.neurips.cc
Video domain generalization aims to learn generalizable video classification models for
unseen target domains by training in a source domain. A critical challenge of video domain …