Masked autoencoders as spatiotemporal learners

Z Xu, R Liu, S Yang, Z Chai… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

The real-world data tends to be heavily imbalanced and severely skew the data-driven deep
neural networks, which makes Long-Tailed Recognition (LTR) a massive challenging task …

被引用次数：20 相关文章所有 7 个版本

[PDF] arxiv.org

Ti-mae: Self-supervised masked time series autoencoders

Z Li, Z Rao, L Pan, P Wang, Z Xu - arXiv preprint arXiv:2301.08871, 2023 - arxiv.org

Multivariate Time Series forecasting has been an increasingly popular topic in various
applications and scenarios. Recently, contrastive learning and Transformer-based models …

被引用次数：30 相关文章所有 3 个版本

[PDF] neurips.cc

Curriculum learning with infant egocentric videos

S Sheybani, H Hansaria, J Wood… - Advances in Neural …, 2024 - proceedings.neurips.cc

Infants possess a remarkable ability to rapidly learn and process visual inputs. As an infant's
mobility increases, so does the variety and dynamics of their visual inputs. Is this change in …

被引用次数：6 相关文章所有 4 个版本

Interacting-enhancing feature transformer for cross-modal remote-sensing image and text retrieval

X Tang, Y Wang, J Ma, X Zhang, F Liu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Cross-modal remote-sensing image–text retrieval (CMRSITR) is a challenging topic in the
remote-sensing (RS) community. It has gained growing attention because it can be flexibly …

被引用次数：18 相关文章所有 2 个版本

[PDF] thecvf.com

1% vs 100%: Parameter-efficient low rank adapter for dense predictions

D Yin, Y Yang, Z Wang, H Yu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Fine-tuning large-scale pre-trained vision models to downstream tasks is a standard
technique for achieving state-of-the-art performance on computer vision benchmarks …

被引用次数：17 相关文章所有 3 个版本

[PDF] thecvf.com

Timebalance: Temporally-invariant and temporally-distinctive video representations for semi-supervised action recognition

IR Dave, MN Rizve, C Chen… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Abstract Semi-Supervised Learning can be more beneficial for the video domain compared
to images because of its higher annotation cost and dimensionality. Besides, any video …

被引用次数：13 相关文章所有 6 个版本

[PDF] thecvf.com

How can objects help action recognition?

X Zhou, A Arnab, C Sun… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Current state-of-the-art video models process a video clip as a long sequence of spatio-
temporal tokens. However, they do not explicitly model objects, their interactions across the …

被引用次数：8 相关文章所有 7 个版本

[PDF] thecvf.com

Multi-Space Alignments Towards Universal LiDAR Segmentation

Y Liu, L Kong, X Wu, R Chen, X Li… - Proceedings of the …, 2024 - openaccess.thecvf.com

A unified and versatile LiDAR segmentation model with strong robustness and
generalizability is desirable for safe autonomous driving perception. This work presents …

被引用次数：4 相关文章所有 6 个版本

[PDF] neurips.cc

Does visual pretraining help end-to-end reasoning?

C Sun, C Luo, X Zhou, A Arnab… - Advances in Neural …, 2024 - proceedings.neurips.cc

We aim to investigate whether end-to-end learning of visual reasoning can be achieved with
general-purpose neural networks, with the help of visual pretraining. A positive result would …

被引用次数：2 相关文章所有 5 个版本

[PDF] thecvf.com

Asymmetric masked distillation for pre-training small foundation models

Z Zhao, B Huang, S Xing, G Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Self-supervised foundation models have shown great potential in computer vision thanks to
the pre-training paradigm of masked autoencoding. Scale is a primary factor influencing the …

被引用次数：3 相关文章所有 3 个版本

高级搜索

QQ 群