Pointcmp: Contrastive mask prediction for self-supervised learning on point cloud videos

Z Shen, X Sheng, L Wang, Y Guo… - Proceedings of the …, 2023 - openaccess.thecvf.com
Self-supervised learning can extract representations of good quality from solely unlabeled
data, which is appealing for point cloud videos due to their high labelling cost. In this paper …

Towards good practices for missing modality robust action recognition

S Woo, S Lee, Y Park, MA Nugroho… - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Standard multi-modal models assume the use of the same modalities in training and
inference stages. However, in practice, the environment in which multi-modal models …

AST: Adaptive Self-supervised Transformer for optical remote sensing representation

Q He, X Sun, Z Yan, B Wang, Z Zhu, W Diao… - ISPRS Journal of …, 2023 - Elsevier
Due to the variation in spatial resolution and the diversity of object scales, the interpretation
of optical remote sensing images is extremely challenging. Deep learning has become the …

What can a cook in Italy teach a mechanic in India? Action Recognition Generalisation Over Scenarios and Locations

C Plizzari, T Perrett, B Caputo… - Proceedings of the …, 2023 - openaccess.thecvf.com
We propose and address a new generalisation problem: can a model trained for action
recognition successfully classify actions when they are performed within a previously …

Cross-view and Cross-pose Completion for 3D Human Understanding

M Armando, S Galaaoui, F Baradel… - Proceedings of the …, 2024 - openaccess.thecvf.com
Human perception and understanding is a major domain of computer vision which like many
other vision subdomains recently stands to gain from the use of large models pre-trained on …

Masked spatio-temporal structure prediction for self-supervised learning on point cloud videos

Z Shen, X Sheng, H Fan, L Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recently, the community has made tremendous progress in developing effective methods
for point cloud video understanding that learn from massive amounts of labeled data …

Simvtp: Simple video text pre-training with masked autoencoders

Y Ma, T Yang, Y Shan, X Li - arXiv preprint arXiv:2212.03490, 2022 - arxiv.org
This paper presents SimVTP: a Simple Video-Text Pretraining framework via masked
autoencoders. We randomly mask out the spatial-temporal tubes of input video and the word …

Smaug: Sparse masked autoencoder for efficient video-language pre-training

Y Lin, C Wei, H Wang, A Yuille… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Video-language pre-training is crucial for learning powerful multi-modal representation.
However, it typically requires a massive amount of computation. In this paper, we develop …

Long-range multimodal pretraining for movie understanding

DM Argaw, JY Lee, M Woodson… - Proceedings of the …, 2023 - openaccess.thecvf.com
Learning computer vision models from (and for) movies has a long-standing history. While
great progress has been attained, there is still a need for a pretrained multimodal model that …

HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition

L Sun, Z Lian, B Liu, J Tao - Information Fusion, 2024 - Elsevier
Abstract Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in
recent years for its critical role in creating emotion-aware intelligent machines. Previous …