Pyslowfast

Y Li, CY Wu, H Fan, K Mangalam… - Proceedings of the …, 2022 - openaccess.thecvf.com

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for
image and video classification, as well as object detection. We present an improved version …

被引用次数：624 相关文章所有 6 个版本

[PDF] arxiv.org

Uniformer: Unifying convolution and self-attention for visual recognition

K Li, Y Wang, J Zhang, P Gao, G Song… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org

It is a challenging task to learn discriminative representation from images and videos, due to
large local redundancy and complex global dependency in these visual data. Convolution …

被引用次数：266 相关文章所有 6 个版本

[PDF] thecvf.com

Multiscale vision transformers

H Fan, B Xiong, K Mangalam, Y Li… - Proceedings of the …, 2021 - openaccess.thecvf.com

Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition,
by connecting the seminal idea of multiscale feature hierarchies with transformer models …

被引用次数：1228 相关文章所有 5 个版本

[PDF] mlr.press

[PDF][PDF] Is space-time attention all you need for video understanding?

G Bertasius, H Wang, L Torresani - ICML, 2021 - proceedings.mlr.press

Training. We train our model for 15 epochs with an initial learning rate of 0.005, which is
divided by 10 at epochs 11, and 14. During training, we first resize the shorter side of the …

被引用次数：1846 相关文章所有 4 个版本

[PDF] thecvf.com

Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition

CY Wu, Y Li, K Mangalam, H Fan… - Proceedings of the …, 2022 - openaccess.thecvf.com

While today's video recognition systems parse snapshots or short clips accurately, they
cannot connect the dots and reason across a longer range of time yet. Most existing video …

被引用次数：175 相关文章所有 5 个版本

[PDF] thecvf.com

Video transformer network

D Neimark, O Bar, M Zohar… - Proceedings of the …, 2021 - openaccess.thecvf.com

This paper presents VTN, a transformer-based framework for video recognition. Inspired by
recent developments in vision transformers, we ditch the standard approach in video action …

被引用次数：489 相关文章所有 9 个版本

[PDF] neurips.cc

Keeping your eye on the ball: Trajectory attention in video transformers

M Patrick, D Campbell, Y Asano… - Advances in neural …, 2021 - proceedings.neurips.cc

In video transformers, the time dimension is often treated in the same way as the two spatial
dimensions. However, in a scene where objects or the camera may move, a physical point …

被引用次数：242 相关文章所有 13 个版本

[PDF] thecvf.com

A large-scale study on unsupervised spatiotemporal representation learning

C Feichtenhofer, H Fan, B Xiong… - Proceedings of the …, 2021 - openaccess.thecvf.com

We present a large-scale study on unsupervised spatiotemporal representation learning
from videos. With a unified perspective on four recent image-based frameworks, we study a …

被引用次数：264 相关文章所有 6 个版本

[PDF] thecvf.com

A simple multi-modality transfer learning baseline for sign language translation

Y Chen, F Wei, X Sun, Z Wu… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

This paper proposes a simple transfer learning baseline for sign language translation.
Existing sign language datasets (eg PHOENIX-2014T, CSL-Daily) contain only about 10K …

被引用次数：107 相关文章所有 5 个版本

[PDF] thecvf.com

Recurring the transformer for video action recognition

J Yang, X Dong, L Liu, C Zhang… - Proceedings of the …, 2022 - openaccess.thecvf.com

Existing video understanding approaches, such as 3D convolutional neural networks and
Transformer-Based methods, usually process the videos in a clip-wise manner. Hence huge …

被引用次数：85 相关文章所有 4 个版本

高级搜索

QQ 群

Mvitv2: Improved multiscale vision transformers for classification and detection

Uniformer: Unifying convolution and self-attention for visual recognition

Multiscale vision transformers

[PDF][PDF] Is space-time attention all you need for video understanding?

Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition

Video transformer network

Keeping your eye on the ball: Trajectory attention in video transformers

A large-scale study on unsupervised spatiotemporal representation learning

A simple multi-modality transfer learning baseline for sign language translation

Recurring the transformer for video action recognition

引用