Y Zang, W Li, J Han, K Zhou, CC Loy - International Journal of Computer …, 2024 - Springer
Abstract Recent Multimodal Large Language Models (MLLMs) are remarkable in vision- language tasks, such as image captioning and question answering, but lack the essential …
Open-vocabulary image segmentation aims to partition an image into semantic regions according to arbitrary text descriptions. However, complex visual scenes can be naturally …
We introduce Florence-2 a novel vision foundation model with a unified prompt-based representation for various computer vision and vision-language tasks. While existing large …
We present GLEE in this work an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework GLEEaccomplishes detection …
This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong …
S Yu, PH Seo, J Son - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
Referring image segmentation (RIS) aims to find a segmentation mask given a referring expression grounded to a region of the input image. Collecting labelled datasets for this …
Y Yang, C Ma, J Yao, Z Zhong, Y Zhang… - European Conference on …, 2025 - Springer
Abstract Referring Image Segmentation (RIS) leveraging transformers has achieved great success on the interpretation of complex visual-language tasks. However, the quadratic …
Abstract Generalized Referring Expression Segmentation (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent …