Mvitv2: Improved multiscale vision transformers for classification and detection

Y Li, CY Wu, H Fan, K Mangalam… - Proceedings of the …, 2022 - openaccess.thecvf.com
In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for
image and video classification, as well as object detection. We present an improved version …

Uniformer: Unifying convolution and self-attention for visual recognition

K Li, Y Wang, J Zhang, P Gao, G Song… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org
It is a challenging task to learn discriminative representation from images and videos, due to
large local redundancy and complex global dependency in these visual data. Convolution …

Multiscale vision transformers

H Fan, B Xiong, K Mangalam, Y Li… - Proceedings of the …, 2021 - openaccess.thecvf.com
Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition,
by connecting the seminal idea of multiscale feature hierarchies with transformer models …

[PDF][PDF] Is space-time attention all you need for video understanding?

G Bertasius, H Wang, L Torresani - ICML, 2021 - proceedings.mlr.press
Training. We train our model for 15 epochs with an initial learning rate of 0.005, which is
divided by 10 at epochs 11, and 14. During training, we first resize the shorter side of the …

Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition

CY Wu, Y Li, K Mangalam, H Fan… - Proceedings of the …, 2022 - openaccess.thecvf.com
While today's video recognition systems parse snapshots or short clips accurately, they
cannot connect the dots and reason across a longer range of time yet. Most existing video …

Video transformer network

D Neimark, O Bar, M Zohar… - Proceedings of the …, 2021 - openaccess.thecvf.com
This paper presents VTN, a transformer-based framework for video recognition. Inspired by
recent developments in vision transformers, we ditch the standard approach in video action …

Keeping your eye on the ball: Trajectory attention in video transformers

M Patrick, D Campbell, Y Asano… - Advances in neural …, 2021 - proceedings.neurips.cc
In video transformers, the time dimension is often treated in the same way as the two spatial
dimensions. However, in a scene where objects or the camera may move, a physical point …

A large-scale study on unsupervised spatiotemporal representation learning

C Feichtenhofer, H Fan, B Xiong… - Proceedings of the …, 2021 - openaccess.thecvf.com
We present a large-scale study on unsupervised spatiotemporal representation learning
from videos. With a unified perspective on four recent image-based frameworks, we study a …

A simple multi-modality transfer learning baseline for sign language translation

Y Chen, F Wei, X Sun, Z Wu… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
This paper proposes a simple transfer learning baseline for sign language translation.
Existing sign language datasets (eg PHOENIX-2014T, CSL-Daily) contain only about 10K …

Recurring the transformer for video action recognition

J Yang, X Dong, L Liu, C Zhang… - Proceedings of the …, 2022 - openaccess.thecvf.com
Existing video understanding approaches, such as 3D convolutional neural networks and
Transformer-Based methods, usually process the videos in a clip-wise manner. Hence huge …