In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version …
Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models …
J Yang, X Dong, L Liu, C Zhang… - Proceedings of the …, 2022 - openaccess.thecvf.com
Existing video understanding approaches, such as 3D convolutional neural networks and Transformer-Based methods, usually process the videos in a clip-wise manner. Hence huge …
A large proportion of construction accidents are caused by unintentional and unsafe actions and behaviors. It is of significant difficulties and ineffectiveness to monitor unsafe behaviors …
Micro-videos have recently gained immense popularity, sparking critical research in micro- video recommendation with significant implications for the entertainment, advertising, and e …
We introduce AugLy, a data augmentation library with a focus on adversarial robustness. AugLy provides a wide array of augmentations for multiple modalities (audio, image, text, & …
We introduce the task of spotting temporally precise, fine-grained events in video (detecting the precise moment in time events occur). Precise spotting requires models to reason …
Machine learning models often fail to generalize well under distributional shifts. Understanding and overcoming these failures have led to a research field of Out-of …
CH Kung, SW Lu, YH Tsai… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
In this paper we study multi-label atomic activity recognition. Despite the notable progress in action recognition it is still challenging to recognize atomic activities due to a deficiency in …