Delving into masked autoencoders for multi-label thorax disease classification

J Xiao, Y Bai, A Yuille, Z Zhou - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract Vision Transformer (ViT) has become one of the most popular neural architectures
due to its simplicity, scalability, and compelling performance in multiple vision tasks …

Mgmae: Motion guided masking for video masked autoencoding

B Huang, Z Zhao, G Zhang, Y Qiao… - Proceedings of the …, 2023 - openaccess.thecvf.com
Masked autoencoding has shown excellent performance on self-supervised video
representation learning. Temporal redundancy has led to a high masking ratio and …

Mart: Masked affective representation learning via masked temporal distribution distillation

Z Zhang, P Zhao, E Park… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Limited training data is a long-standing problem for video emotion analysis (VEA). Existing
works leverage the power of large-scale image datasets for transferring while failing to …

Selective structured state-spaces for long-form video understanding

J Wang, W Zhu, P Wang, X Yu, L Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Effective modeling of complex spatiotemporal dependencies in long-form videos remains an
open problem. The recently proposed Structured State-Space Sequence (S4) model with its …

Audiovisual masked autoencoders

MI Georgescu, E Fonseca, RT Ionescu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Can we leverage the audiovisual information already present in video to improve self-
supervised representation learning? To answer this question, we study various pretraining …

Masked relation learning for deepfake detection

Z Yang, J Liang, Y Xu, XY Zhang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
DeepFake detection aims to differentiate falsified faces from real ones. Most approaches
formulate it as a binary classification problem by solely mining the local artifacts and …

Multi-view masked world models for visual robotic manipulation

Y Seo, J Kim, S James, K Lee… - … on Machine Learning, 2023 - proceedings.mlr.press
Visual robotic manipulation research and applications often use multiple cameras, or views,
to better perceive the world. How else can we utilize the richness of multi-view data? In this …

Segment any point cloud sequences by distilling vision foundation models

Y Liu, L Kong, J Cen, R Chen… - Advances in …, 2024 - proceedings.neurips.cc
Recent advancements in vision foundation models (VFMs) have opened up new
possibilities for versatile and efficient visual perception. In this work, we introduce Seal, a …

On the benefits of 3d pose and tracking for human action recognition

J Rajasegaran, G Pavlakos… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this work we study the benefits of using tracking and 3D poses for action recognition. To
achieve this, we take the Lagrangian view on analysing actions over a trajectory of human …

A survey on masked autoencoder for self-supervised learning in vision and beyond

C Zhang, C Zhang, J Song, JSK Yi, K Zhang… - arXiv preprint arXiv …, 2022 - arxiv.org
Masked autoencoders are scalable vision learners, as the title of MAE\cite {he2022masked},
which suggests that self-supervised learning (SSL) in vision might undertake a similar …