Clip as rnn: Segment countless visual concepts without training endeavor

S Sun, R Li, P Torr, X Gu, S Li - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Existing open-vocabulary image segmentation methods require a fine-tuning step on mask
labels and/or image-text datasets. Mask labels are labor-intensive which limits the number of …

Tagalign: Improving vision-language alignment with multi-tag classification

Q Liu, W Wu, K Zheng, Z Tong, J Liu, Y Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
The crux of learning vision-language models is to extract semantically aligned information
from visual and linguistic data. Existing attempts usually face the problem of coarse …

Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation

JJ Wu, ACH Chang, CY Chuang… - Proceedings of the …, 2024 - openaccess.thecvf.com
This paper addresses text-supervised semantic segmentation aiming to learn a model
capable of segmenting arbitrary visual concepts within images by using only image-text …

Open-vocabulary segmentation with unpaired mask-text supervision

Z Wang, X Xia, Z Chen, X He, Y Guo, M Gong… - arXiv preprint arXiv …, 2024 - arxiv.org
Contemporary cutting-edge open-vocabulary segmentation approaches commonly rely on
image-mask-text triplets, yet this restricted annotation is labour-intensive and encounters …

In defense of lazy visual grounding for open-vocabulary semantic segmentation

D Kang, M Cho - arXiv preprint arXiv:2408.04961, 2024 - arxiv.org
We present lazy visual grounding, a two-stage approach of unsupervised object mask
discovery followed by object grounding, for open-vocabulary semantic segmentation. Plenty …

AlignZeg: Mitigating Objective Misalignment for Zero-shot Semantic Segmentation

J Ge, L Xie, H Xie, P Li, X Zhang, Y Zhang… - European Conference on …, 2025 - Springer
A serious issue that harms the performance of zero-shot visual recognition is named
objective misalignment, ie, the learning objective prioritizes improving the recognition …

Generalization Boosted Adapter for Open-Vocabulary Segmentation

W Xu, C Wang, X Feng, R Xu, L Huang… - … on Circuits and …, 2024 - ieeexplore.ieee.org
Vision-language models (VLMs) have demonstrated remarkable open-vocabulary object
recognition capabilities, motivating their adaptation for dense prediction tasks like …

Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation

Z Zhang, T Zhang, Y Zhu, J Liu, X Liang, QX Ye… - arXiv preprint arXiv …, 2024 - arxiv.org
The pre-trained vision-language model, exemplified by CLIP, advances zero-shot semantic
segmentation by aligning visual features with class embeddings through a transformer …

Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation

S Hajimiri, IB Ayed, J Dolz - arXiv preprint arXiv:2404.08181, 2024 - arxiv.org
Despite the significant progress in deep learning for dense visual recognition problems,
such as semantic segmentation, traditional methods are constrained by fixed class sets …

Multi-modal prototypes for open-world semantic segmentation

Y Yang, C Ma, C Ju, F Zhang, J Yao, Y Zhang… - International Journal of …, 2024 - Springer
In semantic segmentation, generalizing a visual system to both seen categories and novel
categories at inference time has always been practically valuable yet challenging. To enable …