Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million …
Vision language models (VLMs) have demonstrated impressive performance across a wide range of downstream tasks. However, their proficiency in spatial reasoning remains limited …
T Dardouri, L Minkova, JL Espejel, W Dahhane… - arXiv preprint arXiv …, 2024 - arxiv.org
Most instance perception and image understanding solutions focus mainly on natural images. However, applications for synthetic images, and more specifically, images of …
S Bhargava, A Dhoot, IM Jonsson, HL Nguyen… - arXiv preprint arXiv …, 2023 - arxiv.org
Voice assistants help users make phone calls, send messages, create events, navigate, and do a lot more. However, assistants have limited capacity to understand their users' context …
Top-view perspective denotes a typical way in which humans read and reason over different types of maps, and it is vital for localization and navigation of humans as well as ofnon …
R Wang, K Sun, J Kuhn - arXiv preprint arXiv:2411.18564, 2024 - arxiv.org
Large Language Models (LLMs) have demonstrated impressive capabilities across various tasks. However, LLMs often struggle with spatial reasoning which is one essential part of …