Grounding Natural Language Instructions: Can Large Language Models Capture Spatial Information?

Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models

L Lian, B Li, A Yala, T Darrell - arXiv preprint arXiv:2305.13655, 2023 - arxiv.org

Recent advancements in text-to-image diffusion models have yielded impressive results in
generating realistic and diverse images. However, these models still struggle with complex …

被引用次数：141 相关文章所有 3 个版本

[PDF] arxiv.org

Thinking in space: How multimodal large language models see, remember, and recall spaces

J Yang, S Yang, AW Gupta, R Han, L Fei-Fei… - arXiv preprint arXiv …, 2024 - arxiv.org

Humans possess the visual-spatial intelligence to remember spaces from sequential visual
observations. However, can Multimodal Large Language Models (MLLMs) trained on million …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Sparkle: Mastering basic spatial capabilities in vision language models elicits generalization to composite spatial reasoning

Y Tang, A Qu, Z Wang, D Zhuang, Z Wu, W Ma… - arXiv preprint arXiv …, 2024 - arxiv.org

Vision language models (VLMs) have demonstrated impressive performance across a wide
range of downstream tasks. However, their proficiency in spatial reasoning remains limited …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Visual grounding for desktop graphical user interfaces

T Dardouri, L Minkova, JL Espejel, W Dahhane… - arXiv preprint arXiv …, 2024 - arxiv.org

Most instance perception and image understanding solutions focus mainly on natural
images. However, applications for synthetic images, and more specifically, images of …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Referring to screen texts with voice assistants

S Bhargava, A Dhoot, IM Jonsson, HL Nguyen… - arXiv preprint arXiv …, 2023 - arxiv.org

Voice assistants help users make phone calls, send messages, create events, navigate, and
do a lot more. However, assistants have limited capacity to understand their users' context …

被引用次数：3 相关文章所有 5 个版本

[PDF] arxiv.org

TopViewRS: Vision-Language Models as Top-View Spatial Reasoners

C Li, C Zhang, H Zhou, N Collier, A Korhonen… - arXiv preprint arXiv …, 2024 - arxiv.org

Top-view perspective denotes a typical way in which humans read and reason over different
types of maps, and it is vital for localization and navigation of humans as well as ofnon …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

A Pipeline of Neural-Symbolic Integration to Enhance Spatial Reasoning in Large Language Models

R Wang, K Sun, J Kuhn - arXiv preprint arXiv:2411.18564, 2024 - arxiv.org

Large Language Models (LLMs) have demonstrated impressive capabilities across various
tasks. However, LLMs often struggle with spatial reasoning which is one essential part of …

高级搜索

QQ 群