Rethinking video vits: Sparse video tubes for joint image and video learning

AJ Piergiovanni, W Kuo… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
We present a simple approach which can turn a ViT encoder into an efficient video model,
which can seamlessly work with both image and video inputs. By sparsely sampling the …

Prune spatio-temporal tokens by semantic-aware temporal accumulation

S Ding, P Zhao, X Zhang, R Qian… - Proceedings of the …, 2023 - openaccess.thecvf.com
Transformers have become the primary backbone of the computer vision community due to
their impressive performance. However, the unfriendly computation cost impedes their …

Efficient video action detection with token dropout and context refinement

L Chen, Z Tong, Y Song, G Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Streaming video clips with large-scale video tokens impede vision transformers (ViTs) for
efficient recognition, especially in video action detection where sufficient spatiotemporal …

Pmi sampler: Patch similarity guided frame selection for aerial action recognition

R Xian, X Wang, D Kothandaraman… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present a new algorithm for the selection of informative frames in video action
recognition. Our approach is designed for aerial videos captured using a moving camera …

TSNet: Token Sparsification for Efficient Video Transformer

H Wang, W Zhang, G Liu - Applied Sciences, 2023 - mdpi.com
In the domain of video recognition, video transformers have demonstrated remarkable
performance, albeit at significant computational cost. This paper introduces TSNet, an …

Efficient video representation learning via motion-aware token selection

S Hwang, J Yoon, Y Lee, SJ Hwang - arXiv preprint arXiv:2211.10636, 2022 - arxiv.org
Recently emerged Masked Video Modeling techniques demonstrated their potential by
significantly outperforming previous methods in self-supervised learning for video. However …

PLAR: Prompt Learning for Action Recognition

X Wang, R Xian, T Guan, D Manocha - arXiv preprint arXiv:2305.12437, 2023 - arxiv.org
We present a new general learning approach, Prompt Learning for Action Recognition
(PLAR), which leverages the strengths of prompt learning to guide the learning process. Our …

Efficient video transformers via spatial-temporal token merging for action recognition

Z Feng, J Xu, L Ma, S Zhang - ACM Transactions on Multimedia …, 2024 - dl.acm.org
Transformer has exhibited promising performance in various video recognition tasks but
brings a huge computational cost in modeling spatial-temporal cues. This work aims to boost …

Skeletal keypoint-based transformer model for human action recognition in aerial videos

S Uddin, T Nawaz, J Ferryman, N Rashid… - IEEE …, 2024 - ieeexplore.ieee.org
Several efforts have been made to develop effective and robust vision-based solutions for
human action recognition in aerial videos. Generally, the existing methods rely on the …

[HTML][HTML] 基于时空采样的视频行为识别

王冠, 彭梦昊, 陶应诚, 徐浩, 景圣恩 - Artificial Intelligence and …, 2024 - hanspub.org
视频特征包含了行为执行时的时间, 空间冗余信息. 该信息和行为类别无关, 会干扰行为识别,
造成行为类别的错误判断. 本文提出了一种基于时空采样的视频行为识别模型 …