Cut and learn for unsupervised object detection and instance segmentation

X Wang, R Girdhar, SX Yu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract We propose Cut-and-LEaRn (CutLER), a simple approach for training
unsupervised object detection and segmentation models. We leverage the property of self …

Scaling vision transformers to gigapixel images via hierarchical self-supervised learning

RJ Chen, C Chen, Y Li, TY Chen… - Proceedings of the …, 2022 - openaccess.thecvf.com
Abstract Vision Transformers (ViTs) and their multi-scale and hierarchical variations have
been successful at capturing image representations but their use has been generally …

Vision transformers need registers

T Darcet, M Oquab, J Mairal, P Bojanowski - arXiv preprint arXiv …, 2023 - arxiv.org
Transformers have recently emerged as a powerful tool for learning visual representations.
In this paper, we identify and characterize artifacts in feature maps of both supervised and …

Transformer-based visual segmentation: A survey

X Li, H Ding, H Yuan, W Zhang, J Pang… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Visual segmentation seeks to partition images, video frames, or point clouds into multiple
segments or groups. This technique has numerous real-world applications, such as …

Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization

L Melas-Kyriazi, C Rupprecht… - Proceedings of the …, 2022 - openaccess.thecvf.com
Unsupervised localization and segmentation are long-standing computer vision challenges
that involve decomposing an image into semantically-meaningful segments without any …

Neural feature fusion fields: 3d distillation of self-supervised 2d image representations

V Tschernezki, I Laina, D Larlus… - … Conference on 3D …, 2022 - ieeexplore.ieee.org
We present Neural Feature Fusion Fields (N3F),\a method that improves dense 2D image
feature extractors when the latter are applied to the analysis of multiple images …

[PDF][PDF] Deep vit features as dense visual descriptors

S Amir, Y Gandelsman, S Bagon… - arXiv preprint arXiv …, 2021 - dino-vit-features.github.io
We study the use of deep features extracted from a pretrained Vision Transformer (ViT) as
dense visual descriptors. We observe and empirically demonstrate that such features, when …

Bridging the gap to real-world object-centric learning

M Seitzer, M Horn, A Zadaianchuk, D Zietlow… - arXiv preprint arXiv …, 2022 - arxiv.org
Humans naturally decompose their environment into entities at the appropriate level of
abstraction to act in the world. Allowing machine learning algorithms to derive this …

Freesolo: Learning to segment objects without annotations

X Wang, Z Yu, S De Mello, J Kautz… - Proceedings of the …, 2022 - openaccess.thecvf.com
Instance segmentation is a fundamental vision task that aims to recognize and segment
each object in an image. However, it requires costly annotations such as bounding boxes …

Exploiting unlabeled data with vision and language models for object detection

S Zhao, Z Zhang, S Schulter, L Zhao… - European conference on …, 2022 - Springer
Building robust and generic object detection frameworks requires scaling to larger label
spaces and bigger training datasets. However, it is prohibitively costly to acquire annotations …