- 学术资源搜索

A Survey on Self-supervised Learning: Algorithms, Applications, and Future Trends

J Gui, T Chen, J Zhang, Q Cao, Z Sun… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

Deep supervised learning algorithms typically require a large volume of labeled data to
achieve satisfactory performance. However, the process of collecting and labeling such data …

被引用次数：61 相关文章所有 3 个版本

[PDF] arxiv.org

Dinov2: Learning robust visual features without supervision

M Oquab, T Darcet, T Moutakanni, H Vo… - arXiv preprint arXiv …, 2023 - arxiv.org

The recent breakthroughs in natural language processing for model pretraining on large
quantities of data have opened the way for similar foundation models in computer vision …

被引用次数：1353 相关文章所有 11 个版本

[PDF] thecvf.com

Imagebind: One embedding space to bind them all

R Girdhar, A El-Nouby, Z Liu, M Singh… - Proceedings of the …, 2023 - openaccess.thecvf.com

We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …

被引用次数：596 相关文章所有 7 个版本

[PDF] thecvf.com

The effectiveness of MAE pre-pretraining for billion-scale pretraining

M Singh, Q Duval, KV Alwala, H Fan… - Proceedings of the …, 2023 - openaccess.thecvf.com

This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for
visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using …

被引用次数：44 相关文章所有 6 个版本

[PDF] thecvf.com

Vision transformers are parameter-efficient audio-visual learners

YB Lin, YL Sung, J Lei, M Bansal… - Proceedings of the …, 2023 - openaccess.thecvf.com

Vision transformers (ViTs) have achieved impressive results on various computer vision
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …

被引用次数：56 相关文章所有 5 个版本

[PDF] thecvf.com

Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning

R Wang, D Chen, Z Wu, Y Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com

Benefiting from masked visual modeling, self-supervised video representation learning has
achieved remarkable progress. However, existing methods focus on learning …

被引用次数：76 相关文章所有 7 个版本

[PDF] arxiv.org

Contrastive audio-visual masked autoencoder

Y Gong, A Rouditchenko, AH Liu, D Harwath… - arXiv preprint arXiv …, 2022 - arxiv.org

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single
modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio …

被引用次数：115 相关文章所有 5 个版本

[PDF] neurips.cc

4m: Massively multimodal masked modeling

D Mizrahi, R Bachmann, O Kar, T Yeo… - Advances in …, 2024 - proceedings.neurips.cc

Current machine learning models for vision are often highly specialized and limited to a
single modality and task. In contrast, recent large language models exhibit a wide range of …

被引用次数：26 相关文章所有 5 个版本

[PDF] thecvf.com

Mart: Masked affective representation learning via masked temporal distribution distillation

Z Zhang, P Zhao, E Park… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Limited training data is a long-standing problem for video emotion analysis (VEA). Existing
works leverage the power of large-scale image datasets for transferring while failing to …

被引用次数：5 相关文章

[PDF] thecvf.com

Audiovisual masked autoencoders

MI Georgescu, E Fonseca, RT Ionescu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Can we leverage the audiovisual information already present in video to improve self-
supervised representation learning? To answer this question, we study various pretraining …

被引用次数：41 相关文章所有 5 个版本

高级搜索

QQ 群

A Survey on Self-supervised Learning: Algorithms, Applications, and Future Trends

Dinov2: Learning robust visual features without supervision

Imagebind: One embedding space to bind them all

The effectiveness of MAE pre-pretraining for billion-scale pretraining

Vision transformers are parameter-efficient audio-visual learners

Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning

Contrastive audio-visual masked autoencoder

4m: Massively multimodal masked modeling

Mart: Masked affective representation learning via masked temporal distribution distillation

Audiovisual masked autoencoders

引用