Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models

L Lian, B Li, A Yala, T Darrell - arXiv preprint arXiv:2305.13655, 2023 - arxiv.org
Recent advancements in text-to-image diffusion models have yielded impressive results in
generating realistic and diverse images. However, these models still struggle with complex …

Thinking in space: How multimodal large language models see, remember, and recall spaces

J Yang, S Yang, AW Gupta, R Han, L Fei-Fei… - arXiv preprint arXiv …, 2024 - arxiv.org
Humans possess the visual-spatial intelligence to remember spaces from sequential visual
observations. However, can Multimodal Large Language Models (MLLMs) trained on million …

Sparkle: Mastering basic spatial capabilities in vision language models elicits generalization to composite spatial reasoning

Y Tang, A Qu, Z Wang, D Zhuang, Z Wu, W Ma… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision language models (VLMs) have demonstrated impressive performance across a wide
range of downstream tasks. However, their proficiency in spatial reasoning remains limited …

Visual grounding for desktop graphical user interfaces

T Dardouri, L Minkova, JL Espejel, W Dahhane… - arXiv preprint arXiv …, 2024 - arxiv.org
Most instance perception and image understanding solutions focus mainly on natural
images. However, applications for synthetic images, and more specifically, images of …

Referring to screen texts with voice assistants

S Bhargava, A Dhoot, IM Jonsson, HL Nguyen… - arXiv preprint arXiv …, 2023 - arxiv.org
Voice assistants help users make phone calls, send messages, create events, navigate, and
do a lot more. However, assistants have limited capacity to understand their users' context …

TopViewRS: Vision-Language Models as Top-View Spatial Reasoners

C Li, C Zhang, H Zhou, N Collier, A Korhonen… - arXiv preprint arXiv …, 2024 - arxiv.org
Top-view perspective denotes a typical way in which humans read and reason over different
types of maps, and it is vital for localization and navigation of humans as well as ofnon …

A Pipeline of Neural-Symbolic Integration to Enhance Spatial Reasoning in Large Language Models

R Wang, K Sun, J Kuhn - arXiv preprint arXiv:2411.18564, 2024 - arxiv.org
Large Language Models (LLMs) have demonstrated impressive capabilities across various
tasks. However, LLMs often struggle with spatial reasoning which is one essential part of …