Masked autoencoders as spatiotemporal learners

Z Shen, X Sheng, L Wang, Y Guo… - Proceedings of the …, 2023 - openaccess.thecvf.com

Self-supervised learning can extract representations of good quality from solely unlabeled
data, which is appealing for point cloud videos due to their high labelling cost. In this paper …

被引用次数：9 相关文章所有 5 个版本

[PDF] aaai.org

Towards good practices for missing modality robust action recognition

S Woo, S Lee, Y Park, MA Nugroho… - Proceedings of the AAAI …, 2023 - ojs.aaai.org

Standard multi-modal models assume the use of the same modalities in training and
inference stages. However, in practice, the environment in which multi-modal models …

被引用次数：18 相关文章所有 5 个版本

[PDF] utwente.nl

AST: Adaptive Self-supervised Transformer for optical remote sensing representation

Q He, X Sun, Z Yan, B Wang, Z Zhu, W Diao… - ISPRS Journal of …, 2023 - Elsevier

Due to the variation in spatial resolution and the diversity of object scales, the interpretation
of optical remote sensing images is extremely challenging. Deep learning has become the …

被引用次数：12 相关文章所有 5 个版本

[PDF] thecvf.com

What can a cook in Italy teach a mechanic in India? Action Recognition Generalisation Over Scenarios and Locations

C Plizzari, T Perrett, B Caputo… - Proceedings of the …, 2023 - openaccess.thecvf.com

We propose and address a new generalisation problem: can a model trained for action
recognition successfully classify actions when they are performed within a previously …

被引用次数：9 相关文章所有 7 个版本

[PDF] thecvf.com

Cross-view and Cross-pose Completion for 3D Human Understanding

M Armando, S Galaaoui, F Baradel… - Proceedings of the …, 2024 - openaccess.thecvf.com

Human perception and understanding is a major domain of computer vision which like many
other vision subdomains recently stands to gain from the use of large models pre-trained on …

被引用次数：2 相关文章所有 3 个版本

[PDF] thecvf.com

Masked spatio-temporal structure prediction for self-supervised learning on point cloud videos

Z Shen, X Sheng, H Fan, L Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Recently, the community has made tremendous progress in developing effective methods
for point cloud video understanding that learn from massive amounts of labeled data …

被引用次数：3 相关文章所有 5 个版本

[PDF] arxiv.org

Simvtp: Simple video text pre-training with masked autoencoders

Y Ma, T Yang, Y Shan, X Li - arXiv preprint arXiv:2212.03490, 2022 - arxiv.org

This paper presents SimVTP: a Simple Video-Text Pretraining framework via masked
autoencoders. We randomly mask out the spatial-temporal tubes of input video and the word …

被引用次数：18 相关文章所有 2 个版本

[PDF] thecvf.com

Smaug: Sparse masked autoencoder for efficient video-language pre-training

Y Lin, C Wei, H Wang, A Yuille… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Video-language pre-training is crucial for learning powerful multi-modal representation.
However, it typically requires a massive amount of computation. In this paper, we develop …

被引用次数：10 相关文章所有 6 个版本

[PDF] thecvf.com

Long-range multimodal pretraining for movie understanding

DM Argaw, JY Lee, M Woodson… - Proceedings of the …, 2023 - openaccess.thecvf.com

Learning computer vision models from (and for) movies has a long-standing history. While
great progress has been attained, there is still a need for a pretrained multimodal model that …

被引用次数：3 相关文章所有 5 个版本

[PDF] arxiv.org

HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition

L Sun, Z Lian, B Liu, J Tao - Information Fusion, 2024 - Elsevier

Abstract Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in
recent years for its critical role in creating emotion-aware intelligent machines. Previous …

被引用次数：5 相关文章所有 3 个版本

高级搜索

QQ 群