Attentional pooling for action recognition

MH Guo, TX Xu, JJ Liu, ZN Liu, PT Jiang, TJ Mu… - Computational visual …, 2022 - Springer

Humans can naturally and effectively find salient regions in complex scenes. Motivated by
this observation, attention mechanisms were introduced into computer vision with the aim of …

被引用次数：1868 相关文章所有 8 个版本

[PDF] arxiv.org

Human action recognition from various data modalities: A review

Z Sun, Q Ke, H Rahmani, M Bennamoun… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …

被引用次数：626 相关文章所有 16 个版本

[PDF] thecvf.com

Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models

Z Lin, S Yu, Z Kuang, D Pathak… - Proceedings of the …, 2023 - openaccess.thecvf.com

The ability to quickly learn a new task with minimal instruction-known as few-shot learning-is
a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot …

被引用次数：110 相关文章所有 8 个版本

[PDF] thecvf.com

Vivit: A video vision transformer

A Arnab, M Dehghani, G Heigold… - Proceedings of the …, 2021 - openaccess.thecvf.com

We present pure-transformer based models for video classification, drawing upon the recent
success of such models in image classification. Our model extracts spatio-temporal tokens …

被引用次数：2650 相关文章所有 9 个版本

[PDF] neurips.cc

Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text

H Akbari, L Yuan, R Qian… - Advances in …, 2021 - proceedings.neurips.cc

We present a framework for learning multimodal representations from unlabeled data using
convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer …

被引用次数：691 相关文章所有 9 个版本

[PDF] thecvf.com

Anticipative video transformer

R Girdhar, K Grauman - Proceedings of the IEEE/CVF …, 2021 - openaccess.thecvf.com

Abstract We propose Anticipative Video Transformer (AVT), an end-to-end attention-based
video modeling architecture that attends to the previously observed video in order to …

被引用次数：246 相关文章所有 6 个版本

[PDF] thecvf.com

Video action transformer network

R Girdhar, J Carreira, C Doersch… - Proceedings of the …, 2019 - openaccess.thecvf.com

Abstract We introduce the Action Transformer model for recognizing and localizing human
actions in video clips. We repurpose a Transformer-style architecture to aggregate features …

被引用次数：917 相关文章所有 11 个版本

[PDF] springer.com

Pooling in convolutional neural networks for medical image analysis: a survey and an empirical study

R Nirthika, S Manivannan, A Ramanan… - Neural Computing and …, 2022 - Springer

Convolutional neural networks (CNN) are widely used in computer vision and medical
image analysis as the state-of-the-art technique. In CNN, pooling layers are included mainly …

被引用次数：164 相关文章所有 12 个版本

[PDF] researchgate.net

Attention, please! A survey of neural attention models in deep learning

A de Santana Correia, EL Colombini - Artificial Intelligence Review, 2022 - Springer

In humans, Attention is a core property of all perceptual and cognitive operations. Given our
limited ability to process competing sources, attention mechanisms select, modulate, and …

被引用次数：223 相关文章所有 8 个版本

[PDF] neurips.cc

A^ 2-nets: Double attention networks

Y Chen, Y Kalantidis, J Li, S Yan… - Advances in neural …, 2018 - proceedings.neurips.cc

Learning to capture long-range relations is fundamental to image/video recognition. Existing
CNN models generally rely on increasing depth to model such relations which is highly …

被引用次数：705 相关文章所有 8 个版本

高级搜索

QQ 群