OmniParser: A Unified Framework for Text Spotting Key Information Extraction and Table Recognition

Document parsing unveiled: Techniques, challenges, and prospects for structured information extraction

Q Zhang, VSJ Huang, B Wang, J Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org

Document parsing is essential for converting unstructured and semi-structured documents-
such as contracts, academic papers, and invoices-into structured, machine-readable data …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

Tabpedia: Towards comprehensive visual table understanding with concept synergy

W Zhao, H Feng, Q Liu, J Tang, S Wei, B Wu… - arXiv preprint arXiv …, 2024 - arxiv.org

Tables contain factual and quantitative data accompanied by various structures and
contents that pose challenges for machine comprehension. Previous methods generally …

被引用次数：8 相关文章所有 2 个版本

[PDF] arxiv.org

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

J Tang, C Lin, Z Zhao, S Wei, B Wu, Q Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

Text-centric visual question answering (VQA) has made great strides with the development
of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of …

被引用次数：16 相关文章所有 2 个版本

[PDF] arxiv.org

Mindbench: A comprehensive benchmark for mind map structure recognition and analysis

L Chen, F Yan, Y Zhong, S Chen, Z Jie, L Ma - arXiv preprint arXiv …, 2024 - arxiv.org

Multimodal Large Language Models (MLLM) have made significant progress in the field of
document analysis. Despite this, existing benchmarks typically focus only on extracting text …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Doclayllm: An efficient and effective multi-modal extension of large language models for text-rich document understanding

W Liao, J Wang, H Li, C Wang, J Huang… - arXiv preprint arXiv …, 2024 - arxiv.org

Text-rich document understanding (TDU) refers to analyzing and comprehending
documents containing substantial textual content. With the rapid evolution of large language …

被引用次数：1 相关文章所有 3 个版本

[PDF] arxiv.org

VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer

H Zhong, Z Yang, Z Li, P Wang, J Tang… - Proceedings of the …, 2024 - dl.acm.org

Text recognition is an inherent integration of vision and language, encompassing the visual
texture in stroke patterns and the semantic context among the character sequences …

InstructOCR: Instruction Boosting Scene Text Spotting

C Duan, Q Jiang, P Fu, J Chen, S Li, Z Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

In the field of scene text spotting, previous OCR methods primarily relied on image encoders
and pre-trained text information, but they often overlooked the advantages of incorporating …

相关文章所有 2 个版本

[PDF] arxiv.org

DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness

A Mohammadshirazi, PPG Neogi, SN Lim… - arXiv preprint arXiv …, 2024 - arxiv.org

Document Visual Question Answering (VQA) requires models to interpret textual information
within complex visual layouts and comprehend spatial relationships to answer questions …

相关文章所有 2 个版本

Efficient title text detection using multi-loss

S Prasad, A Abraham - International Journal on Document Analysis and …, 2024 - Springer

Abstract YouTube's “Video Chapter” feature segments videos into different sections, marked
by timestamps on the slider, enhancing user navigation. Given the vast volume of video …

相关文章所有 2 个版本

[PDF] arxiv.org

Enhancing Table Recognition with Vision LLMs: A Benchmark and Neighbor-Guided Toolchain Reasoner

Y Zhou, M Cheng, Q Mao, Q Liu, F Xu, X Li… - arXiv preprint arXiv …, 2024 - arxiv.org

Pre-trained foundation models have recently significantly progressed in structured table
understanding and reasoning. However, despite advancements in areas such as table …

相关文章所有 2 个版本

高级搜索

QQ 群