相关文章- 学术资源搜索

Kosmos-2: Grounding multimodal large language models to the world

Z Peng, W Wang, L Dong, Y Hao, S Huang… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
capabilities of perceiving object descriptions (eg, bounding boxes) and grounding text to the …

被引用次数：351 相关文章所有 2 个版本

[PDF] openreview.net

Grounding multimodal large language models to the world

Z Peng, W Wang, L Dong, Y Hao, S Huang… - The Twelfth …, 2024 - openreview.net

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
capabilities of perceiving object descriptions (eg, bounding boxes) and grounding text to the …

被引用次数：10 相关文章所有 2 个版本

[PDF] arxiv.org

Ferret: Refer and ground anything anywhere at any granularity

H You, H Zhang, Z Gan, X Du, B Zhang, Z Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of
understanding spatial referring of any shape or granularity within an image and accurately …

被引用次数：110 相关文章所有 3 个版本

[PDF] thecvf.com

Glamm: Pixel grounding large multimodal model

H Rasheed, M Maaz, S Shaji… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Large Multimodal Models (LMMs) extend Large Language Models to the vision
domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual …

被引用次数：60 相关文章所有 3 个版本

[PDF] mlr.press

Grounding language models to images for multimodal inputs and outputs

JY Koh, R Salakhutdinov… - … Conference on Machine …, 2023 - proceedings.mlr.press

We propose an efficient method to ground pretrained text-only language models to the
visual domain, enabling them to process arbitrarily interleaved image-and-text data, and …

被引用次数：132 相关文章所有 7 个版本

[PDF] neurips.cc

Language is not all you need: Aligning perception with language models

S Huang, L Dong, W Wang, Y Hao… - Advances in …, 2024 - proceedings.neurips.cc

A big convergence of language, multimodal perception, action, and world modeling is a key
step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a …

被引用次数：329 相关文章所有 5 个版本

[PDF] arxiv.org

Bubogpt: Enabling visual grounding in multi-modal llms

Y Zhao, Z Lin, D Zhou, Z Huang, J Feng… - arXiv preprint arXiv …, 2023 - arxiv.org

LLMs have demonstrated remarkable abilities at interacting with humans through language,
especially with the usage of instruction-following data. Recent advancements in LLMs, such …

被引用次数：60 相关文章所有 2 个版本

[PDF] arxiv.org

Next-chat: An lmm for chat, detection and segmentation

A Zhang, L Zhao, CW Xie, Y Zheng, W Ji… - arXiv preprint arXiv …, 2023 - arxiv.org

The development of large language models (LLMs) has greatly advanced the field of
multimodal understanding, leading to the emergence of large multimodal models (LMMs). In …

被引用次数：17 相关文章所有 3 个版本

[PDF] arxiv.org

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

M Reid, N Savinov, D Teplyashin, D Lepikhin… - arXiv preprint arXiv …, 2024 - arxiv.org

In this report, we present the latest model of the Gemini family, Gemini 1.5 Pro, a highly
compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning …

被引用次数：198 相关文章所有 4 个版本

[PDF] arxiv.org

Modeling context in referring expressions

L Yu, P Poirson, S Yang, AC Berg, TL Berg - Computer Vision–ECCV 2016 …, 2016 - Springer

Humans refer to objects in their environments all the time, especially in dialogue with other
people. We explore generating and comprehending natural language referring expressions …

被引用次数：1094 相关文章所有 7 个版本

高级搜索

QQ 群

Kosmos-2: Grounding multimodal large language models to the world

Grounding multimodal large language models to the world

Ferret: Refer and ground anything anywhere at any granularity

Glamm: Pixel grounding large multimodal model

Grounding language models to images for multimodal inputs and outputs

Language is not all you need: Aligning perception with language models

Bubogpt: Enabling visual grounding in multi-modal llms

Next-chat: An lmm for chat, detection and segmentation

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Modeling context in referring expressions

引用