The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems unifying various vision-language (VL) tasks by instruction …
C Liu, X Li, H Ding - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Significant advancements have been made in image editing with the recent advance of the Diffusion model. However most of the current methods primarily focus on global or subject …
S He, H Ding - Proceedings of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Referring video segmentation relies on natural language expressions to identify and segment objects often emphasizing motion clues. Previous works treat a sentence as a …
L Ji, Y Du, Y Dang, W Gao, H Zhang - Neurocomputing, 2024 - Elsevier
Referring image segmentation is guided by natural language descriptions to separate the target objects in an image. This task is different from semantic segmentation and instance …
Most multimodal large language models (MLLMs) learn language-to-object grounding through causal language modeling where grounded objects are captured by bounding …
We present DetToolChain, a novel prompting paradigm, to unleash the zero-shot object detection ability of multimodal large language models (MLLMs), such as GPT-4V and …
S Dai, J Liu, NM Cheung - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Existing counting tasks are limited to the class level which don't account for fine-grained details within the class. In real applications it often requires in-context or referring human …
Y Zang, C Fu, R Cao, D Zhu, M Zhang, W Hu… - arXiv preprint arXiv …, 2024 - arxiv.org
Referring expression segmentation (RES), a task that involves localizing specific instance- level objects based on free-form linguistic descriptions, has emerged as a crucial frontier in …
C Xie, Z Zhang, Y Wu, F Zhu, R Zhao… - arXiv preprint arXiv …, 2023 - arxiv.org
Detecting objects based on language descriptions is a popular task that includes Open- Vocabulary object Detection (OVD) and Referring Expression Comprehension (REC). In this …