Space-time mixing attention for video transformer

Z Sun, Q Ke, H Rahmani, M Bennamoun… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …

被引用次数：535 相关文章所有 16 个版本

[PDF] neurips.cc

St-adapter: Parameter-efficient image-to-video transfer learning

J Pan, Z Lin, X Zhu, J Shao, H Li - Advances in Neural …, 2022 - proceedings.neurips.cc

Capitalizing on large pre-trained models for various downstream tasks of interest have
recently emerged with promising performance. Due to the ever-growing model size, the …

被引用次数：199 相关文章所有 7 个版本

[PDF] arxiv.org

Uniformer: Unifying convolution and self-attention for visual recognition

K Li, Y Wang, J Zhang, P Gao, G Song… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org

It is a challenging task to learn discriminative representation from images and videos, due to
large local redundancy and complex global dependency in these visual data. Convolution …

被引用次数：336 相关文章所有 6 个版本

[PDF] arxiv.org

Prompting visual-language models for efficient video understanding

C Ju, T Han, K Zheng, Y Zhang, W Xie - European Conference on …, 2022 - Springer

Image-based visual-language (I-VL) pre-training has shown great success for learning joint
visual-textual representations from large-scale web data, revealing remarkable ability for …

被引用次数：349 相关文章所有 6 个版本

[PDF] arxiv.org

Edgevits: Competing light-weight cnns on mobile devices with vision transformers

J Pan, A Bulat, F Tan, X Zhu, L Dudziak, H Li… - … on Computer Vision, 2022 - Springer

Self-attention based models such as vision transformers (ViTs) have emerged as a very
competitive architecture alternative to convolutional neural networks (CNNs) in computer …

被引用次数：182 相关文章所有 9 个版本

[PDF] ieee.org

Transformer meets remote sensing video detection and tracking: A comprehensive survey

L Jiao, X Zhang, X Liu, F Liu, S Yang… - IEEE Journal of …, 2023 - ieeexplore.ieee.org

Transformer has shown excellent performance in remote sensing field with long-range
modeling capabilities. Remote sensing video (RSV) moving object detection and tracking …

被引用次数：17 相关文章所有 2 个版本

[PDF] arxiv.org

Ts2-net: Token shift and selection transformer for text-video retrieval

Y Liu, P Xiong, L Xu, S Cao, Q Jin - European conference on computer …, 2022 - Springer

Text-Video retrieval is a task of great practical value and has received increasing attention,
among which learning spatial-temporal video representation is one of the research hotspots …

被引用次数：115 相关文章所有 5 个版本

[PDF] thecvf.com

Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning

R Wang, D Chen, Z Wu, Y Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com

Benefiting from masked visual modeling, self-supervised video representation learning has
achieved remarkable progress. However, existing methods focus on learning …

被引用次数：77 相关文章所有 7 个版本

[PDF] thecvf.com

Physformer: Facial video-based physiological measurement with temporal difference transformer

Z Yu, Y Shen, J Shi, H Zhao… - Proceedings of the …, 2022 - openaccess.thecvf.com

Remote photoplethysmography (rPPG), which aims at measuring heart activities and
physiological signals from facial video without any contact, has great potential in many …

被引用次数：175 相关文章所有 11 个版本

[PDF] arxiv.org

Adaptive token sampling for efficient vision transformers

M Fayyaz, SA Koohpayegani, FR Jafari… - … on Computer Vision, 2022 - Springer

While state-of-the-art vision transformer models achieve promising results in image
classification, they are computationally expensive and require many GFLOPs. Although the …

被引用次数：157 相关文章所有 10 个版本

高级搜索

QQ 群