Video action transformer network

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org

Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …

被引用次数：2213 相关文章所有 8 个版本

[PDF] arxiv.org

Human action recognition from various data modalities: A review

Z Sun, Q Ke, H Rahmani, M Bennamoun… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …

被引用次数：435 相关文章所有 16 个版本

[PDF] arxiv.org

Visual prompt tuning

M Jia, L Tang, BC Chen, C Cardie, S Belongie… - … on Computer Vision, 2022 - Springer

The current modus operandi in adapting pre-trained models involves updating all the
backbone parameters, ie., full fine-tuning. This paper introduces Visual Prompt Tuning (VPT) …

被引用次数：1094 相关文章所有 7 个版本

[PDF] thecvf.com

Flava: A foundational language and vision alignment model

A Singh, R Hu, V Goswami… - Proceedings of the …, 2022 - openaccess.thecvf.com

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic
pretraining for obtaining good performance on a variety of downstream tasks. Generally …

被引用次数：531 相关文章所有 6 个版本

[PDF] thecvf.com

Mvitv2: Improved multiscale vision transformers for classification and detection

Y Li, CY Wu, H Fan, K Mangalam… - Proceedings of the …, 2022 - openaccess.thecvf.com

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for
image and video classification, as well as object detection. We present an improved version …

被引用次数：611 相关文章所有 6 个版本

[PDF] arxiv.org

Actionformer: Localizing moments of actions with transformers

CL Zhang, J Wu, Y Li - European Conference on Computer Vision, 2022 - Springer

Self-attention based Transformer models have demonstrated impressive results for image
classification and object detection, and more recently for video understanding. Inspired by …

被引用次数：283 相关文章所有 7 个版本

[PDF] thecvf.com

Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction

Y Zhang, Z Zhu, D Du - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

The vision-based perception for autonomous driving has undergone a transformation from
the bird-eye-view (BEV) representations to the 3D semantic occupancy. Compared with the …

被引用次数：74 相关文章所有 5 个版本

[PDF] thecvf.com

Multiscale vision transformers

H Fan, B Xiong, K Mangalam, Y Li… - Proceedings of the …, 2021 - openaccess.thecvf.com

Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition,
by connecting the seminal idea of multiscale feature hierarchies with transformer models …

被引用次数：1209 相关文章所有 5 个版本

[PDF] thecvf.com

Vivit: A video vision transformer

A Arnab, M Dehghani, G Heigold… - Proceedings of the …, 2021 - openaccess.thecvf.com

We present pure-transformer based models for video classification, drawing upon the recent
success of such models in image classification. Our model extracts spatio-temporal tokens …

被引用次数：2023 相关文章所有 9 个版本

[PDF] mlr.press

Perceiver: General perception with iterative attention

A Jaegle, F Gimeno, A Brock… - International …, 2021 - proceedings.mlr.press

Biological systems understand the world by simultaneously processing high-dimensional
inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The …

被引用次数：830 相关文章所有 7 个版本

高级搜索

QQ 群