Kosmos-2: Grounding multimodal large language models to the world

Z Peng, W Wang, L Dong, Y Hao, S Huang… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
capabilities of perceiving object descriptions (eg, bounding boxes) and grounding text to the …

Grounding multimodal large language models to the world

Z Peng, W Wang, L Dong, Y Hao, S Huang… - The Twelfth …, 2024 - openreview.net
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
capabilities of perceiving object descriptions (eg, bounding boxes) and grounding text to the …

Ferret: Refer and ground anything anywhere at any granularity

H You, H Zhang, Z Gan, X Du, B Zhang, Z Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of
understanding spatial referring of any shape or granularity within an image and accurately …

Glamm: Pixel grounding large multimodal model

H Rasheed, M Maaz, S Shaji… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Large Multimodal Models (LMMs) extend Large Language Models to the vision
domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual …

Grounding language models to images for multimodal inputs and outputs

JY Koh, R Salakhutdinov… - … Conference on Machine …, 2023 - proceedings.mlr.press
We propose an efficient method to ground pretrained text-only language models to the
visual domain, enabling them to process arbitrarily interleaved image-and-text data, and …

Language is not all you need: Aligning perception with language models

S Huang, L Dong, W Wang, Y Hao… - Advances in …, 2024 - proceedings.neurips.cc
A big convergence of language, multimodal perception, action, and world modeling is a key
step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a …

Bubogpt: Enabling visual grounding in multi-modal llms

Y Zhao, Z Lin, D Zhou, Z Huang, J Feng… - arXiv preprint arXiv …, 2023 - arxiv.org
LLMs have demonstrated remarkable abilities at interacting with humans through language,
especially with the usage of instruction-following data. Recent advancements in LLMs, such …

Next-chat: An lmm for chat, detection and segmentation

A Zhang, L Zhao, CW Xie, Y Zheng, W Ji… - arXiv preprint arXiv …, 2023 - arxiv.org
The development of large language models (LLMs) has greatly advanced the field of
multimodal understanding, leading to the emergence of large multimodal models (LMMs). In …

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

M Reid, N Savinov, D Teplyashin, D Lepikhin… - arXiv preprint arXiv …, 2024 - arxiv.org
In this report, we present the latest model of the Gemini family, Gemini 1.5 Pro, a highly
compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning …

Modeling context in referring expressions

L Yu, P Poirson, S Yang, AC Berg, TL Berg - Computer Vision–ECCV 2016 …, 2016 - Springer
Humans refer to objects in their environments all the time, especially in dialogue with other
people. We explore generating and comprehending natural language referring expressions …