scene text recognition vision language- 学术资源搜索

过去一年中添加的文章，按日期排序

Multimodal heterogeneous graph entity-level fusion for named entity recognition with multi-granularity visual guidance

Y Gong, X Lv, Z Yuan, ZJ Wang, F Hu, X You - The Journal of …, 2024 - Springer

5 天前 - … Multimodal named entity recognition (MNER) is an emerging foundational task
in natural language processing. However, existing methods have two main limitations: 1) …

[PDF] arxiv.org

Visual Text Generation in the Wild

Y Zhu, J Liu, F Gao, W Liu, X Wang, P Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

9 天前 - … In TRCG, we leverage the visual reasoning ability of Multimodal Large Language …
Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE …

相关文章所有 2 个版本

[PDF] arxiv.org

VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

S Moon, H Woo, H Park, H Jung, R Mahjourian… - arXiv preprint arXiv …, 2024 - arxiv.org

10 天前 - … by a Vision-Language Model (VLM) and refined by a Large Language Model (LLM)
as … the potential of visual semantics, we propose VisionTrap, a vision-augmented trajectory …

相关文章所有 2 个版本

Scene text recognition: an Indic perspective

VP Vijayan, S Chanda, D Doermann… - … Analysis and Recognition …, 2024 - Springer

12 天前 - … Exploring Scene Text Recognition (STR) in Indian languages … visual features and
language knowledge for word … the Indian language Tamil, Malayalam, and Telugu scene text …

[PDF] arxiv.org

Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

R Li, Z Zhang, C He, Z Ma, VM Patel… - arXiv preprint arXiv …, 2024 - arxiv.org

15 天前 - … coarse view- or region-level text prompts, we leverage large vision-language models
to extract complete category information and scalable scene descriptions to build the text …

相关文章所有 2 个版本

[PDF] arxiv.org

OVExp: Open Vocabulary Exploration for Object-Oriented Navigation

M Wei, T Wang, Y Chen, H Wang, J Pang… - arXiv preprint arXiv …, 2024 - arxiv.org

16 天前 - … Specifically, leveraging CLIP’s joint visual-language space, … into language
feature-based maps through CLIP’s text … between the languagebased map and the vision-based …

相关文章所有 2 个版本

[PDF] arxiv.org

LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition

T Wang, L Meng, L Cheng, C Sun - arXiv preprint arXiv:2407.06730, 2024 - arxiv.org

18 天前 - … field of VPR for generation high-level language descriptions of the visual scene,
and attempt to build a discriminative global representation by fusing visual and text features. …

相关文章所有 2 个版本

[PDF] arxiv.org

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

P Jiao, N Zhao, J Chen, YG Jiang - arXiv preprint arXiv:2407.05256, 2024 - arxiv.org

21 天前 - … Additionally, to align the 3D space with the powerful vision-language space, we …
the vision-language feature space using a pre-trained VLM at the instance, category, and scene …

相关文章所有 2 个版本

9 Interpretation of Deep

SU LIME, S Bayrak - Explainable, Interpretable, and Transparent …, 2024 - books.google.com

23 天前 - … such as visual perception, speech recognition, decision-making, and language
translation… signal processing, natural language processing (NLP), computer vision, and robotics. …

Holistic scene understanding through image and video scene graphs

Y Cong - 2024 - repo.uni-hannover.de

35 天前 - … scene understanding, as well as a promising tool to bridge the domains of vision and
language. … lacks a comprehensive, systematic analysis of scene graphs and their practical …

joint learning visual scene

高级搜索

QQ 群

Multimodal heterogeneous graph entity-level fusion for named entity recognition with multi-granularity visual guidance

Visual Text Generation in the Wild

VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions

Scene text recognition: an Indic perspective

Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

OVExp: Open Vocabulary Exploration for Object-Oriented Navigation

LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

9 Interpretation of Deep

Holistic scene understanding through image and video scene graphs

相关搜索

引用