Learning imbalanced data with vision transformers

Z Xu, R Liu, S Yang, Z Chai… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
The real-world data tends to be heavily imbalanced and severely skew the data-driven deep
neural networks, which makes Long-Tailed Recognition (LTR) a massive challenging task …

Ti-mae: Self-supervised masked time series autoencoders

Z Li, Z Rao, L Pan, P Wang, Z Xu - arXiv preprint arXiv:2301.08871, 2023 - arxiv.org
Multivariate Time Series forecasting has been an increasingly popular topic in various
applications and scenarios. Recently, contrastive learning and Transformer-based models …

Curriculum learning with infant egocentric videos

S Sheybani, H Hansaria, J Wood… - Advances in Neural …, 2024 - proceedings.neurips.cc
Infants possess a remarkable ability to rapidly learn and process visual inputs. As an infant's
mobility increases, so does the variety and dynamics of their visual inputs. Is this change in …

Interacting-enhancing feature transformer for cross-modal remote-sensing image and text retrieval

X Tang, Y Wang, J Ma, X Zhang, F Liu… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Cross-modal remote-sensing image–text retrieval (CMRSITR) is a challenging topic in the
remote-sensing (RS) community. It has gained growing attention because it can be flexibly …

1% vs 100%: Parameter-efficient low rank adapter for dense predictions

D Yin, Y Yang, Z Wang, H Yu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Fine-tuning large-scale pre-trained vision models to downstream tasks is a standard
technique for achieving state-of-the-art performance on computer vision benchmarks …

Timebalance: Temporally-invariant and temporally-distinctive video representations for semi-supervised action recognition

IR Dave, MN Rizve, C Chen… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract Semi-Supervised Learning can be more beneficial for the video domain compared
to images because of its higher annotation cost and dimensionality. Besides, any video …

How can objects help action recognition?

X Zhou, A Arnab, C Sun… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Current state-of-the-art video models process a video clip as a long sequence of spatio-
temporal tokens. However, they do not explicitly model objects, their interactions across the …

Multi-Space Alignments Towards Universal LiDAR Segmentation

Y Liu, L Kong, X Wu, R Chen, X Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
A unified and versatile LiDAR segmentation model with strong robustness and
generalizability is desirable for safe autonomous driving perception. This work presents …

Does visual pretraining help end-to-end reasoning?

C Sun, C Luo, X Zhou, A Arnab… - Advances in Neural …, 2024 - proceedings.neurips.cc
We aim to investigate whether end-to-end learning of visual reasoning can be achieved with
general-purpose neural networks, with the help of visual pretraining. A positive result would …

Asymmetric masked distillation for pre-training small foundation models

Z Zhao, B Huang, S Xing, G Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Self-supervised foundation models have shown great potential in computer vision thanks to
the pre-training paradigm of masked autoencoding. Scale is a primary factor influencing the …