The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models …
Neural compression is the application of neural networks and other machine learning methods to data compression. Recent advances in statistical machine learning have opened …
Abstract Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity. Previous VFMs rely on Image Foundation Models …
Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to …
The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, obtaining annotations is expensive and requires …
In this paper, we study masked autoencoder (MAE) pretraining on videos for matching- based downstream tasks, including visual object tracking (VOT) and video object …
M Gehrig, D Scaramuzza - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com
Abstract We present Recurrent Vision Transformers (RVTs), a novel backbone for object detection with event cameras. Event cameras provide visual information with sub …
Visual model-based reinforcement learning (RL) has the potential to enable sample-efficient robot learning from visual observations. Yet the current approaches typically train a single …
Z Xing, Q Dai, H Hu, J Chen, Z Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Semi-supervised action recognition is a challenging but critical task due to the high cost of video annotations. Existing approaches mainly use convolutional neural networks, yet …