Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Contextual object detection with multimodal large language models

Y Zang, W Li, J Han, K Zhou, CC Loy - International Journal of Computer …, 2024 - Springer
Abstract Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-
language tasks, such as image captioning and question answering, but lack the essential …

Hierarchical open-vocabulary universal image segmentation

X Wang, S Li, K Kallidromitis, Y Kato… - Advances in …, 2024 - proceedings.neurips.cc
Open-vocabulary image segmentation aims to partition an image into semantic regions
according to arbitrary text descriptions. However, complex visual scenes can be naturally …

Florence-2: Advancing a unified representation for a variety of vision tasks

B Xiao, H Wu, W Xu, X Dai, H Hu, Y Lu… - Proceedings of the …, 2024 - openaccess.thecvf.com
We introduce Florence-2 a novel vision foundation model with a unified prompt-based
representation for various computer vision and vision-language tasks. While existing large …

General object foundation model for images and videos at scale

J Wu, Y Jiang, Q Liu, Z Yuan… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present GLEE in this work an object-level foundation model for locating and identifying
objects in images and videos. Through a unified framework GLEEaccomplishes detection …

Pali-3 vision language models: Smaller, faster, stronger

X Chen, X Wang, L Beyer, A Kolesnikov, J Wu… - arXiv preprint arXiv …, 2023 - arxiv.org
This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that
compares favorably to similar models that are 10x larger. As part of arriving at this strong …

Zero-shot referring image segmentation with global-local context features

S Yu, PH Seo, J Son - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
Referring image segmentation (RIS) aims to find a segmentation mask given a referring
expression grounded to a region of the input image. Collecting labelled datasets for this …

Remamber: Referring image segmentation with mamba twister

Y Yang, C Ma, J Yao, Z Zhong, Y Zhang… - European Conference on …, 2025 - Springer
Abstract Referring Image Segmentation (RIS) leveraging transformers has achieved great
success on the interpretation of complex visual-language tasks. However, the quadratic …

Gsva: Generalized segmentation via multimodal large language models

Z Xia, D Han, Y Han, X Pan, S Song… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Generalized Referring Expression Segmentation (GRES) extends the scope of
classic RES to refer to multiple objects in one expression or identify the empty targets absent …

Segment every reference object in spatial and temporal spaces

J Wu, Y Jiang, B Yan, H Lu… - Proceedings of the …, 2023 - openaccess.thecvf.com
The reference-based object segmentation tasks, namely referring image segmentation
(RIS), referring video object segmentation (RVOS), and video object segmentation (VOS) …