CL Zhang, J Wu, Y Li - European Conference on Computer Vision, 2022 - Springer
Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding. Inspired by …
J Yang, C Li, X Dai, J Gao - Advances in Neural Information …, 2022 - proceedings.neurips.cc
We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation module for modeling token interactions in vision …
Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. The ability of capturing short-and long-range visual dependencies …
Abstract Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. The ability to capture local and global visual dependencies through …
Abstract Convolutional Neural Networks (CNNs) have dominated computer vision for years, due to its ability in capturing locality and translation invariance. Recently, many vision …
The task of action detection aims at deducing both the action category and localization of the start and end moment for each action instance in a long, untrimmed video. While vision …
J Liang, H Zhu, E Zhang… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
Distracted driver actions can be dangerous and cause severe accidents. Thus, it is important to detect and eliminate distracted driving behaviors on the road to save lives. To this end, we …
TN Tang, K Kim, K Sohn - arXiv preprint arXiv:2303.09055, 2023 - arxiv.org
Temporal Action Localization (TAL) is a challenging task in video understanding that aims to identify and localize actions within a video sequence. Recent studies have emphasized the …
We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video by exploiting multi-scale temporal features. In existing …