Masked autoencoding has shown excellent performance on self-supervised video representation learning. Temporal redundancy has led to a high masking ratio and …
Z Zhang, P Zhao, E Park… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Limited training data is a long-standing problem for video emotion analysis (VEA). Existing works leverage the power of large-scale image datasets for transferring while failing to …
J Wang, W Zhu, P Wang, X Yu, L Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Effective modeling of complex spatiotemporal dependencies in long-form videos remains an open problem. The recently proposed Structured State-Space Sequence (S4) model with its …
Can we leverage the audiovisual information already present in video to improve self- supervised representation learning? To answer this question, we study various pretraining …
Z Yang, J Liang, Y Xu, XY Zhang… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
DeepFake detection aims to differentiate falsified faces from real ones. Most approaches formulate it as a binary classification problem by solely mining the local artifacts and …
Visual robotic manipulation research and applications often use multiple cameras, or views, to better perceive the world. How else can we utilize the richness of multi-view data? In this …
Recent advancements in vision foundation models (VFMs) have opened up new possibilities for versatile and efficient visual perception. In this work, we introduce Seal, a …
In this work we study the benefits of using tracking and 3D poses for action recognition. To achieve this, we take the Lagrangian view on analysing actions over a trajectory of human …
Masked autoencoders are scalable vision learners, as the title of MAE\cite {he2022masked}, which suggests that self-supervised learning (SSL) in vision might undertake a similar …