过去一年中添加的文章,按日期排序

Multimodal heterogeneous graph entity-level fusion for named entity recognition with multi-granularity visual guidance

Y Gong, X Lv, Z Yuan, ZJ Wang, F Hu, X You - The Journal of …, 2024 - Springer
5 天前 - … Multimodal named entity recognition (MNER) is an emerging foundational task
in natural language processing. However, existing methods have two main limitations: 1) …

Visual Text Generation in the Wild

Y Zhu, J Liu, F Gao, W Liu, X Wang, P Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
9 天前 - … In TRCG, we leverage the visual reasoning ability of Multimodal Large Language
Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE …

VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

S Moon, H Woo, H Park, H Jung, R Mahjourian… - arXiv preprint arXiv …, 2024 - arxiv.org
10 天前 - … by a Vision-Language Model (VLM) and refined by a Large Language Model (LLM)
as … the potential of visual semantics, we propose VisionTrap, a vision-augmented trajectory …

Scene text recognition: an Indic perspective

VP Vijayan, S Chanda, D Doermann… - … Analysis and Recognition …, 2024 - Springer
12 天前 - … Exploring Scene Text Recognition (STR) in Indian languagesvisual features and
language knowledge for word … the Indian language Tamil, Malayalam, and Telugu scene text

Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

R Li, Z Zhang, C He, Z Ma, VM Patel… - arXiv preprint arXiv …, 2024 - arxiv.org
15 天前 - … coarse view- or region-level text prompts, we leverage large vision-language models
to extract complete category information and scalable scene descriptions to build the text

OVExp: Open Vocabulary Exploration for Object-Oriented Navigation

M Wei, T Wang, Y Chen, H Wang, J Pang… - arXiv preprint arXiv …, 2024 - arxiv.org
16 天前 - … Specifically, leveraging CLIP’s joint visual-language space, … into language
feature-based maps through CLIP’s text … between the languagebased map and the vision-based …

LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition

T Wang, L Meng, L Cheng, C Sun - arXiv preprint arXiv:2407.06730, 2024 - arxiv.org
18 天前 - … field of VPR for generation high-level language descriptions of the visual scene,
and attempt to build a discriminative global representation by fusing visual and text features. …

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

P Jiao, N Zhao, J Chen, YG Jiang - arXiv preprint arXiv:2407.05256, 2024 - arxiv.org
21 天前 - … Additionally, to align the 3D space with the powerful vision-language space, we …
the vision-language feature space using a pre-trained VLM at the instance, category, and scene

9 Interpretation of Deep

SU LIME, S Bayrak - Explainable, Interpretable, and Transparent …, 2024 - books.google.com
23 天前 - … such as visual perception, speech recognition, decision-making, and language
translation… signal processing, natural language processing (NLP), computer vision, and robotics. …

Holistic scene understanding through image and video scene graphs

Y Cong - 2024 - repo.uni-hannover.de
35 天前 - scene understanding, as well as a promising tool to bridge the domains of vision and
language. … lacks a comprehensive, systematic analysis of scene graphs and their practical …