Real-world robot learning with masked visual pre-training

I Radosavovic, T Xiao, S James… - … on Robot Learning, 2023 - proceedings.mlr.press
In this work, we explore self-supervised visual pre-training on images from diverse, in-the-
wild videos for real-world robotic tasks. Like prior work, our visual representations are pre …

Internvideo: General video foundation models via generative and discriminative learning

Y Wang, K Li, Y Li, Y He, B Huang, Z Zhao… - arXiv preprint arXiv …, 2022 - arxiv.org
The foundation models have recently shown excellent performance on a variety of
downstream tasks in computer vision. However, most existing vision foundation models …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Unmasked teacher: Towards training-efficient video foundation models

K Li, Y Wang, Y Li, Y Wang, Y He… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Video Foundation Models (VFMs) have received limited exploration due to high
computational costs and data scarcity. Previous VFMs rely on Image Foundation Models …

Hiera: A hierarchical vision transformer without the bells-and-whistles

C Ryali, YT Hu, D Bolya, C Wei, H Fan… - International …, 2023 - proceedings.mlr.press
Modern hierarchical vision transformers have added several vision-specific components in
the pursuit of supervised classification performance. While these components lead to …

Self-supervised learning for videos: A survey

MC Schiappa, YS Rawat, M Shah - ACM Computing Surveys, 2023 - dl.acm.org
The remarkable success of deep learning in various domains relies on the availability of
large-scale annotated datasets. However, obtaining annotations is expensive and requires …

Dropmae: Masked autoencoders with spatial-attention dropout for tracking tasks

Q Wu, T Yang, Z Liu, B Wu, Y Shan… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this paper, we study masked autoencoder (MAE) pretraining on videos for matching-
based downstream tasks, including visual object tracking (VOT) and video object …

Recurrent vision transformers for object detection with event cameras

M Gehrig, D Scaramuzza - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com
Abstract We present Recurrent Vision Transformers (RVTs), a novel backbone for object
detection with event cameras. Event cameras provide visual information with sub …

Masked world models for visual control

Y Seo, D Hafner, H Liu, F Liu, S James… - … on Robot Learning, 2023 - proceedings.mlr.press
Visual model-based reinforcement learning (RL) has the potential to enable sample-efficient
robot learning from visual observations. Yet the current approaches typically train a single …

Svformer: Semi-supervised video transformer for action recognition

Z Xing, Q Dai, H Hu, J Chen, Z Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Semi-supervised action recognition is a challenging but critical task due to the high cost of
video annotations. Existing approaches mainly use convolutional neural networks, yet …