X Li, K Yuan, Y Pei, Y Lu, M Sun… - Proceedings of the …, 2024 - openaccess.thecvf.com
This paper reviews the NTIRE 2024 Challenge on Shortform UGC Video Quality Assessment (S-UGC VQA) where various excellent solutions are submitted and evaluated …
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …
Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video …
Abstract Pretraining Vision Transformers (ViTs) has achieved great success in visual recognition. A following scenario is to adapt a ViT to various image and video recognition …
This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to spatiotemporal representation learning from videos. We randomly mask out spacetime …
This work investigates a simple yet powerful adapter for Vision Transformer (ViT). Unlike recent visual transformers that introduce vision-specific inductive biases into their …
Z Tong, Y Song, J Wang… - Advances in neural …, 2022 - proceedings.neurips.cc
Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video …
Y Li, G Yuan, Y Wen, J Hu… - Advances in …, 2022 - proceedings.neurips.cc
Abstract Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of …
J Xu, S De Mello, S Liu, W Byeon… - Proceedings of the …, 2022 - openaccess.thecvf.com
Grouping and recognition are important components of visual scene understanding, eg, for object detection and semantic segmentation. With end-to-end deep learning systems …