Document parsing unveiled: Techniques, challenges, and prospects for structured information extraction

Q Zhang, VSJ Huang, B Wang, J Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Document parsing is essential for converting unstructured and semi-structured documents-
such as contracts, academic papers, and invoices-into structured, machine-readable data …

Tabpedia: Towards comprehensive visual table understanding with concept synergy

W Zhao, H Feng, Q Liu, J Tang, S Wei, B Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
Tables contain factual and quantitative data accompanied by various structures and
contents that pose challenges for machine comprehension. Previous methods generally …

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

J Tang, C Lin, Z Zhao, S Wei, B Wu, Q Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Text-centric visual question answering (VQA) has made great strides with the development
of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of …

Mindbench: A comprehensive benchmark for mind map structure recognition and analysis

L Chen, F Yan, Y Zhong, S Chen, Z Jie, L Ma - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal Large Language Models (MLLM) have made significant progress in the field of
document analysis. Despite this, existing benchmarks typically focus only on extracting text …

Doclayllm: An efficient and effective multi-modal extension of large language models for text-rich document understanding

W Liao, J Wang, H Li, C Wang, J Huang… - arXiv preprint arXiv …, 2024 - arxiv.org
Text-rich document understanding (TDU) refers to analyzing and comprehending
documents containing substantial textual content. With the rapid evolution of large language …

VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer

H Zhong, Z Yang, Z Li, P Wang, J Tang… - Proceedings of the …, 2024 - dl.acm.org
Text recognition is an inherent integration of vision and language, encompassing the visual
texture in stroke patterns and the semantic context among the character sequences …

InstructOCR: Instruction Boosting Scene Text Spotting

C Duan, Q Jiang, P Fu, J Chen, S Li, Z Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
In the field of scene text spotting, previous OCR methods primarily relied on image encoders
and pre-trained text information, but they often overlooked the advantages of incorporating …

DLaVA: Document Language and Vision Assistant for Answer Localization with Enhanced Interpretability and Trustworthiness

A Mohammadshirazi, PPG Neogi, SN Lim… - arXiv preprint arXiv …, 2024 - arxiv.org
Document Visual Question Answering (VQA) requires models to interpret textual information
within complex visual layouts and comprehend spatial relationships to answer questions …

Efficient title text detection using multi-loss

S Prasad, A Abraham - International Journal on Document Analysis and …, 2024 - Springer
Abstract YouTube's “Video Chapter” feature segments videos into different sections, marked
by timestamps on the slider, enhancing user navigation. Given the vast volume of video …

Enhancing Table Recognition with Vision LLMs: A Benchmark and Neighbor-Guided Toolchain Reasoner

Y Zhou, M Cheng, Q Mao, Q Liu, F Xu, X Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Pre-trained foundation models have recently significantly progressed in structured table
understanding and reasoning. However, despite advancements in areas such as table …