K-centered patch sampling for efficient video recognition

AJ Piergiovanni, W Kuo… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

We present a simple approach which can turn a ViT encoder into an efficient video model,
which can seamlessly work with both image and video inputs. By sparsely sampling the …

被引用次数：54 相关文章所有 7 个版本

[PDF] thecvf.com

Prune spatio-temporal tokens by semantic-aware temporal accumulation

S Ding, P Zhao, X Zhang, R Qian… - Proceedings of the …, 2023 - openaccess.thecvf.com

Transformers have become the primary backbone of the computer vision community due to
their impressive performance. However, the unfriendly computation cost impedes their …

被引用次数：6 相关文章所有 6 个版本

[PDF] thecvf.com

Efficient video action detection with token dropout and context refinement

L Chen, Z Tong, Y Song, G Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Streaming video clips with large-scale video tokens impede vision transformers (ViTs) for
efficient recognition, especially in video action detection where sufficient spatiotemporal …

被引用次数：8 相关文章所有 5 个版本

[PDF] thecvf.com

Pmi sampler: Patch similarity guided frame selection for aerial action recognition

R Xian, X Wang, D Kothandaraman… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present a new algorithm for the selection of informative frames in video action
recognition. Our approach is designed for aerial videos captured using a moving camera …

被引用次数：3 相关文章所有 5 个版本

[PDF] mdpi.com

TSNet: Token Sparsification for Efficient Video Transformer

H Wang, W Zhang, G Liu - Applied Sciences, 2023 - mdpi.com

In the domain of video recognition, video transformers have demonstrated remarkable
performance, albeit at significant computational cost. This paper introduces TSNet, an …

被引用次数：1 相关文章所有 4 个版本

[PDF] arxiv.org

Efficient video representation learning via motion-aware token selection

S Hwang, J Yoon, Y Lee, SJ Hwang - arXiv preprint arXiv:2211.10636, 2022 - arxiv.org

Recently emerged Masked Video Modeling techniques demonstrated their potential by
significantly outperforming previous methods in self-supervised learning for video. However …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

PLAR: Prompt Learning for Action Recognition

X Wang, R Xian, T Guan, D Manocha - arXiv preprint arXiv:2305.12437, 2023 - arxiv.org

We present a new general learning approach, Prompt Learning for Action Recognition
(PLAR), which leverages the strengths of prompt learning to guide the learning process. Our …

被引用次数：1 相关文章所有 2 个版本

Efficient video transformers via spatial-temporal token merging for action recognition

Z Feng, J Xu, L Ma, S Zhang - ACM Transactions on Multimedia …, 2024 - dl.acm.org

Transformer has exhibited promising performance in various video recognition tasks but
brings a huge computational cost in modeling spatial-temporal cues. This work aims to boost …

被引用次数：1 相关文章

[PDF] ieee.org

Skeletal keypoint-based transformer model for human action recognition in aerial videos

S Uddin, T Nawaz, J Ferryman, N Rashid… - IEEE …, 2024 - ieeexplore.ieee.org

Several efforts have been made to develop effective and robust vision-based solutions for
human action recognition in aerial videos. Generally, the existing methods rely on the …

被引用次数：1 相关文章所有 5 个版本

[HTML] hanspub.org

[HTML][HTML] 基于时空采样的视频行为识别

王冠，彭梦昊，陶应诚，徐浩，景圣恩 - Artificial Intelligence and …, 2024 - hanspub.org

视频特征包含了行为执行时的时间, 空间冗余信息. 该信息和行为类别无关, 会干扰行为识别,
造成行为类别的错误判断. 本文提出了一种基于时空采样的视频行为识别模型 …

高级搜索

QQ 群