Document understanding dataset and evaluation (dude)

J Van Landeghem, R Tito… - Proceedings of the …, 2023 - openaccess.thecvf.com
We call on the Document AI (DocAI) community to re-evaluate current methodologies and
embrace the challenge of creating more practically-oriented benchmarks. Document …

Due: End-to-end document understanding benchmark

Ł Borchmann, M Pietruszka, T Stanislawek… - Thirty-fifth Conference …, 2021 - openreview.net
Understanding documents with rich layouts plays a vital role in digitization and hyper-
automation but remains a challenging topic in the NLP research community. Additionally, the …

Ocr-free document understanding transformer

G Kim, T Hong, M Yim, JY Nam, J Park, J Yim… - … on Computer Vision, 2022 - Springer
Understanding document images (eg, invoices) is a core but challenging task since it
requires complex functions such as reading text and a holistic understanding of the …

mplug-docowl 1.5: Unified structure learning for ocr-free document understanding

A Hu, H Xu, J Ye, M Yan, L Zhang, B Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Structure information is critical for understanding the semantics of text-rich images, such as
documents, tables, and charts. Existing Multimodal Large Language Models (MLLMs) for …

Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding

Z Gu, C Meng, K Wang, J Lan… - Proceedings of the …, 2022 - openaccess.thecvf.com
Recently, various multimodal networks for Visually-Rich Document Understanding (VRDU)
have been proposed, showing the promotion of transformers by integrating visual and layout …

Docformer: End-to-end transformer for document understanding

S Appalaraju, B Jasani, BU Kota… - Proceedings of the …, 2021 - openaccess.thecvf.com
We present DocFormer-a multi-modal transformer based architecture for the task of Visual
Document Understanding (VDU). VDU is a challenging problem which aims to understand …

Unifying vision, text, and layout for universal document processing

Z Tang, Z Yang, G Wang, Y Fang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract We propose Universal Document Processing (UDOP), a foundation Document AI
model which unifies text, image, and layout modalities together with varied task formats …

Textmonkey: An ocr-free large multimodal model for understanding document

Y Liu, B Yang, Q Liu, Z Li, Z Ma, S Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks,
including document question answering (DocVQA) and scene text analysis. Our approach …

Attention where it matters: Rethinking visual document understanding with selective region concentration

H Cao, C Bao, C Liu, H Chen, K Yin… - Proceedings of the …, 2023 - openaccess.thecvf.com
We propose a novel end-to-end document understanding model called SeRum (SElective
Region Understanding Model) for extracting meaningful information from document images …

Layoutlmv2: Multi-modal pre-training for visually-rich document understanding

Y Xu, Y Xu, T Lv, L Cui, F Wei, G Wang, Y Lu… - arXiv preprint arXiv …, 2020 - arxiv.org
Pre-training of text and layout has proved effective in a variety of visually-rich document
understanding tasks due to its effective model architecture and the advantage of large-scale …