Masked autoencoders as spatiotemporal learners

J Xiao, Y Bai, A Yuille, Z Zhou - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Abstract Vision Transformer (ViT) has become one of the most popular neural architectures
due to its simplicity, scalability, and compelling performance in multiple vision tasks …

被引用次数：50 相关文章所有 5 个版本

[PDF] thecvf.com

Mgmae: Motion guided masking for video masked autoencoding

B Huang, Z Zhao, G Zhang, Y Qiao… - Proceedings of the …, 2023 - openaccess.thecvf.com

Masked autoencoding has shown excellent performance on self-supervised video
representation learning. Temporal redundancy has led to a high masking ratio and …

被引用次数：12 相关文章所有 5 个版本

[PDF] thecvf.com

Mart: Masked affective representation learning via masked temporal distribution distillation

Z Zhang, P Zhao, E Park… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Limited training data is a long-standing problem for video emotion analysis (VEA). Existing
works leverage the power of large-scale image datasets for transferring while failing to …

被引用次数：5 相关文章

[PDF] thecvf.com

Selective structured state-spaces for long-form video understanding

J Wang, W Zhu, P Wang, X Yu, L Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Effective modeling of complex spatiotemporal dependencies in long-form videos remains an
open problem. The recently proposed Structured State-Space Sequence (S4) model with its …

被引用次数：43 相关文章所有 8 个版本

[PDF] thecvf.com

Audiovisual masked autoencoders

MI Georgescu, E Fonseca, RT Ionescu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Can we leverage the audiovisual information already present in video to improve self-
supervised representation learning? To answer this question, we study various pretraining …

被引用次数：32 相关文章所有 5 个版本

Masked relation learning for deepfake detection

Z Yang, J Liang, Y Xu, XY Zhang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

DeepFake detection aims to differentiate falsified faces from real ones. Most approaches
formulate it as a binary classification problem by solely mining the local artifacts and …

被引用次数：40 相关文章所有 2 个版本

[PDF] mlr.press

Multi-view masked world models for visual robotic manipulation

Y Seo, J Kim, S James, K Lee… - … on Machine Learning, 2023 - proceedings.mlr.press

Visual robotic manipulation research and applications often use multiple cameras, or views,
to better perceive the world. How else can we utilize the richness of multi-view data? In this …

被引用次数：27 相关文章所有 8 个版本

[PDF] neurips.cc

Segment any point cloud sequences by distilling vision foundation models

Y Liu, L Kong, J Cen, R Chen… - Advances in …, 2024 - proceedings.neurips.cc

Recent advancements in vision foundation models (VFMs) have opened up new
possibilities for versatile and efficient visual perception. In this work, we introduce Seal, a …

被引用次数：18 相关文章所有 6 个版本

[PDF] thecvf.com

On the benefits of 3d pose and tracking for human action recognition

J Rajasegaran, G Pavlakos… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work we study the benefits of using tracking and 3D poses for action recognition. To
achieve this, we take the Lagrangian view on analysing actions over a trajectory of human …

被引用次数：22 相关文章所有 5 个版本

[PDF] arxiv.org

A survey on masked autoencoder for self-supervised learning in vision and beyond

C Zhang, C Zhang, J Song, JSK Yi, K Zhang… - arXiv preprint arXiv …, 2022 - arxiv.org

Masked autoencoders are scalable vision learners, as the title of MAE\cite {he2022masked},
which suggests that self-supervised learning (SSL) in vision might undertake a similar …

被引用次数：61 相关文章所有 2 个版本

高级搜索

QQ 群