Masked autoencoders as spatiotemporal learners

I Radosavovic, T Xiao, S James… - … on Robot Learning, 2023 - proceedings.mlr.press

In this work, we explore self-supervised visual pre-training on images from diverse, in-the-
wild videos for real-world robotic tasks. Like prior work, our visual representations are pre …

被引用次数：176 相关文章所有 4 个版本

[PDF] arxiv.org

Internvideo: General video foundation models via generative and discriminative learning

Y Wang, K Li, Y Li, Y He, B Huang, Z Zhao… - arXiv preprint arXiv …, 2022 - arxiv.org

The foundation models have recently shown excellent performance on a variety of
downstream tasks in computer vision. However, most existing vision foundation models …

被引用次数：209 相关文章所有 2 个版本

[PDF] nowpublishers.com

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com

Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

被引用次数：111 相关文章所有 6 个版本

[PDF] thecvf.com

Unmasked teacher: Towards training-efficient video foundation models

K Li, Y Wang, Y Li, Y Wang, Y He… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Video Foundation Models (VFMs) have received limited exploration due to high
computational costs and data scarcity. Previous VFMs rely on Image Foundation Models …

被引用次数：79 相关文章所有 5 个版本

[PDF] mlr.press

Hiera: A hierarchical vision transformer without the bells-and-whistles

C Ryali, YT Hu, D Bolya, C Wei, H Fan… - International …, 2023 - proceedings.mlr.press

Modern hierarchical vision transformers have added several vision-specific components in
the pursuit of supervised classification performance. While these components lead to …

被引用次数：47 相关文章所有 6 个版本

[PDF] arxiv.org

Self-supervised learning for videos: A survey

MC Schiappa, YS Rawat, M Shah - ACM Computing Surveys, 2023 - dl.acm.org

The remarkable success of deep learning in various domains relies on the availability of
large-scale annotated datasets. However, obtaining annotations is expensive and requires …

被引用次数：106 相关文章所有 4 个版本

[PDF] thecvf.com

Dropmae: Masked autoencoders with spatial-attention dropout for tracking tasks

Q Wu, T Yang, Z Liu, B Wu, Y Shan… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this paper, we study masked autoencoder (MAE) pretraining on videos for matching-
based downstream tasks, including visual object tracking (VOT) and video object …

被引用次数：59 相关文章所有 6 个版本

[PDF] thecvf.com

Recurrent vision transformers for object detection with event cameras

M Gehrig, D Scaramuzza - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com

Abstract We present Recurrent Vision Transformers (RVTs), a novel backbone for object
detection with event cameras. Event cameras provide visual information with sub …

被引用次数：73 相关文章所有 8 个版本

[PDF] mlr.press

Masked world models for visual control

Y Seo, D Hafner, H Liu, F Liu, S James… - … on Robot Learning, 2023 - proceedings.mlr.press

Visual model-based reinforcement learning (RL) has the potential to enable sample-efficient
robot learning from visual observations. Yet the current approaches typically train a single …

被引用次数：92 相关文章所有 6 个版本

[PDF] thecvf.com

Svformer: Semi-supervised video transformer for action recognition

Z Xing, Q Dai, H Hu, J Chen, Z Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Semi-supervised action recognition is a challenging but critical task due to the high cost of
video annotations. Existing approaches mainly use convolutional neural networks, yet …

被引用次数：62 相关文章所有 6 个版本

高级搜索

QQ 群