Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders

R Zhang, L Wang, Y Qiao, P Gao… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Pre-training by numerous image data has become de-facto for robust 2D representations. In
contrast, due to the expensive data processing, a paucity of 3D datasets severely hinders …

Context autoencoder for self-supervised representation learning

X Chen, M Ding, X Wang, Y Xin, S Mo, Y Wang… - International Journal of …, 2024 - Springer
We present a novel masked image modeling (MIM) approach, context autoencoder (CAE),
for self-supervised representation pretraining. We pretrain an encoder by making predictions …

Hiera: A hierarchical vision transformer without the bells-and-whistles

C Ryali, YT Hu, D Bolya, C Wei, H Fan… - International …, 2023 - proceedings.mlr.press
Modern hierarchical vision transformers have added several vision-specific components in
the pursuit of supervised classification performance. While these components lead to …

What to hide from your students: Attention-guided masked image modeling

I Kakogeorgiou, S Gidaris, B Psomas, Y Avrithis… - … on Computer Vision, 2022 - Springer
Transformers and masked language modeling are quickly being adopted and explored in
computer vision as vision transformers and masked image modeling (MIM). In this work, we …

Masked image modeling with local multi-scale reconstruction

H Wang, Y Tang, Y Wang, J Guo… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Masked Image Modeling (MIM) achieves outstanding success in self-supervised
representation learning. Unfortunately, MIM models typically have huge computational …

A survey on masked autoencoder for self-supervised learning in vision and beyond

C Zhang, C Zhang, J Song, JSK Yi, K Zhang… - arXiv preprint arXiv …, 2022 - arxiv.org
Masked autoencoders are scalable vision learners, as the title of MAE\cite {he2022masked},
which suggests that self-supervised learning (SSL) in vision might undertake a similar …

Masked modeling for self-supervised representation learning on vision and beyond

S Li, L Zhang, Z Wang, D Wu, L Wu, Z Liu, J Xia… - arXiv preprint arXiv …, 2023 - arxiv.org
As the deep learning revolution marches on, self-supervised learning has garnered
increasing attention in recent years thanks to its remarkable representation learning ability …

Torchsparse++: Efficient training and inference framework for sparse convolution on gpus

H Tang, S Yang, Z Liu, K Hong, Z Yu, X Li… - Proceedings of the 56th …, 2023 - dl.acm.org
Sparse convolution plays a pivotal role in emerging workloads, including point cloud
processing in AR/VR, autonomous driving, and graph understanding in recommendation …

Stitchable neural networks

Z Pan, J Cai, B Zhuang - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
The public model zoo containing enormous powerful pretrained model families (eg,
ResNet/DeiT) has reached an unprecedented scope than ever, which significantly …