Self-supervised learning by cross-modal audio-video clustering

L Ericsson, H Gouk, CC Loy… - IEEE Signal Processing …, 2022 - ieeexplore.ieee.org

Self-supervised representation learning (SSRL) methods aim to provide powerful, deep
feature learning without the requirement of large annotated data sets, thus alleviating the …

被引用次数：258 相关文章所有 7 个版本

[PDF] arxiv.org

Self-supervised learning for videos: A survey

MC Schiappa, YS Rawat, M Shah - ACM Computing Surveys, 2023 - dl.acm.org

The remarkable success of deep learning in various domains relies on the availability of
large-scale annotated datasets. However, obtaining annotations is expensive and requires …

被引用次数：96 相关文章所有 4 个版本

[PDF] thecvf.com

Videomae v2: Scaling video masked autoencoders with dual masking

L Wang, B Huang, Z Zhao, Z Tong… - Proceedings of the …, 2023 - openaccess.thecvf.com

Scale is the primary factor for building a powerful foundation model that could well
generalize to a variety of downstream tasks. However, it is still challenging to train video …

被引用次数：178 相关文章所有 7 个版本

[PDF] neurips.cc

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Z Tong, Y Song, J Wang… - Advances in neural …, 2022 - proceedings.neurips.cc

Pre-training video transformers on extra large-scale datasets is generally required to
achieve premier performance on relatively small datasets. In this paper, we show that video …

被引用次数：698 相关文章所有 6 个版本

[PDF] arxiv.org

A cookbook of self-supervised learning

R Balestriero, M Ibrahim, V Sobal, A Morcos… - arXiv preprint arXiv …, 2023 - arxiv.org

Self-supervised learning, dubbed the dark matter of intelligence, is a promising path to
advance machine learning. Yet, much like cooking, training SSL methods is a delicate art …

被引用次数：204 相关文章所有 5 个版本

[PDF] neurips.cc

Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text

H Akbari, L Yuan, R Qian… - Advances in …, 2021 - proceedings.neurips.cc

We present a framework for learning multimodal representations from unlabeled data using
convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer …

被引用次数：556 相关文章所有 8 个版本

[PDF] neurips.cc

Hard negative mixing for contrastive learning

Y Kalantidis, MB Sariyildiz, N Pion… - Advances in neural …, 2020 - proceedings.neurips.cc

Contrastive learning has become a key component of self-supervised learning approaches
for computer vision. By learning to embed two augmented versions of the same image close …

被引用次数：588 相关文章所有 7 个版本

[PDF] thecvf.com

A large-scale study on unsupervised spatiotemporal representation learning

C Feichtenhofer, H Fan, B Xiong… - Proceedings of the …, 2021 - openaccess.thecvf.com

We present a large-scale study on unsupervised spatiotemporal representation learning
from videos. With a unified perspective on four recent image-based frameworks, we study a …

被引用次数：259 相关文章所有 7 个版本

[PDF] arxiv.org

Learning audio-visual speech representation by masked multimodal cluster prediction

B Shi, WN Hsu, K Lakhotia, A Mohamed - arXiv preprint arXiv:2201.02184, 2022 - arxiv.org

Video recordings of speech contain correlated audio and visual information, providing a
strong signal for speech representation learning from the speaker's lip movements and the …

被引用次数：219 相关文章所有 4 个版本

[PDF] arxiv.org

Human action recognition from various data modalities: A review

Z Sun, Q Ke, H Rahmani, M Bennamoun… - IEEE transactions on …, 2022 - ieeexplore.ieee.org

Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …

被引用次数：413 相关文章所有 16 个版本

高级搜索

QQ 群