Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early …
Pre-training by numerous image data has become de-facto for robust 2D representations. In contrast, due to the expensive data processing, a paucity of 3D datasets severely hinders …
Q Chen, X Chen, J Wang, S Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Detection transformer (DETR) relies on one-to-one assignment, assigning one ground-truth object to one prediction, for end-to-end detection without NMS post-processing. It is known …
Masked image modeling (MIM) has attracted much research attention due to its promising potential for learning scalable visual representations. In typical approaches, models usually …
Masked Autoencoder (MAE) has demonstrated superior performance on various vision tasks via randomly masking image patches and reconstruction. However, effective data …
Y Liu, S Zhang, J Chen, Z Yu… - Proceedings of the …, 2023 - openaccess.thecvf.com
There has been significant progress in Masked Image Modeling (MIM). Existing MIM methods can be broadly categorized into two groups based on the reconstruction target …
Masked autoencoders are scalable vision learners, as the title of MAE\cite {he2022masked}, which suggests that self-supervised learning (SSL) in vision might undertake a similar …
As the deep learning revolution marches on, self-supervised learning has garnered increasing attention in recent years thanks to its remarkable representation learning ability …
K Cai, P Ren, Y Zhu, H Xu, J Liu, C Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recently, semantic segmentation models trained with image-level text supervision have shown promising results in challenging open-world scenarios. However, these models still …