相关文章- 学术资源搜索

Document understanding dataset and evaluation (dude)

J Van Landeghem, R Tito… - Proceedings of the …, 2023 - openaccess.thecvf.com

We call on the Document AI (DocAI) community to re-evaluate current methodologies and
embrace the challenge of creating more practically-oriented benchmarks. Document …

被引用次数：25 相关文章所有 9 个版本

[PDF] openreview.net

Due: End-to-end document understanding benchmark

Ł Borchmann, M Pietruszka, T Stanislawek… - Thirty-fifth Conference …, 2021 - openreview.net

Understanding documents with rich layouts plays a vital role in digitization and hyper-
automation but remains a challenging topic in the NLP research community. Additionally, the …

被引用次数：48 相关文章所有 7 个版本

[PDF] arxiv.org

Ocr-free document understanding transformer

G Kim, T Hong, M Yim, JY Nam, J Park, J Yim… - … on Computer Vision, 2022 - Springer

Understanding document images (eg, invoices) is a core but challenging task since it
requires complex functions such as reading text and a holistic understanding of the …

被引用次数：238 相关文章所有 6 个版本

[PDF] arxiv.org

mplug-docowl 1.5: Unified structure learning for ocr-free document understanding

A Hu, H Xu, J Ye, M Yan, L Zhang, B Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org

Structure information is critical for understanding the semantics of text-rich images, such as
documents, tables, and charts. Existing Multimodal Large Language Models (MLLMs) for …

被引用次数：26 相关文章所有 2 个版本

[PDF] thecvf.com

Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding

Z Gu, C Meng, K Wang, J Lan… - Proceedings of the …, 2022 - openaccess.thecvf.com

Recently, various multimodal networks for Visually-Rich Document Understanding (VRDU)
have been proposed, showing the promotion of transformers by integrating visual and layout …

被引用次数：70 相关文章所有 5 个版本

[PDF] thecvf.com

Docformer: End-to-end transformer for document understanding

S Appalaraju, B Jasani, BU Kota… - Proceedings of the …, 2021 - openaccess.thecvf.com

We present DocFormer-a multi-modal transformer based architecture for the task of Visual
Document Understanding (VDU). VDU is a challenging problem which aims to understand …

被引用次数：259 相关文章所有 6 个版本

[PDF] thecvf.com

Unifying vision, text, and layout for universal document processing

Z Tang, Z Yang, G Wang, Y Fang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract We propose Universal Document Processing (UDOP), a foundation Document AI
model which unifies text, image, and layout modalities together with varied task formats …

被引用次数：73 相关文章所有 6 个版本

[PDF] arxiv.org

Textmonkey: An ocr-free large multimodal model for understanding document

Y Liu, B Yang, Q Liu, Z Li, Z Ma, S Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org

We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks,
including document question answering (DocVQA) and scene text analysis. Our approach …

被引用次数：34 相关文章所有 2 个版本

[PDF] thecvf.com

Attention where it matters: Rethinking visual document understanding with selective region concentration

H Cao, C Bao, C Liu, H Chen, K Yin… - Proceedings of the …, 2023 - openaccess.thecvf.com

We propose a novel end-to-end document understanding model called SeRum (SElective
Region Understanding Model) for extracting meaningful information from document images …

被引用次数：8 相关文章所有 5 个版本

[PDF] arxiv.org

Layoutlmv2: Multi-modal pre-training for visually-rich document understanding

Y Xu, Y Xu, T Lv, L Cui, F Wei, G Wang, Y Lu… - arXiv preprint arXiv …, 2020 - arxiv.org

Pre-training of text and layout has proved effective in a variety of visually-rich document
understanding tasks due to its effective model architecture and the advantage of large-scale …

被引用次数：470 相关文章所有 7 个版本

高级搜索

QQ 群

Document understanding dataset and evaluation (dude)

Due: End-to-end document understanding benchmark

Ocr-free document understanding transformer

mplug-docowl 1.5: Unified structure learning for ocr-free document understanding

Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding

Docformer: End-to-end transformer for document understanding

Unifying vision, text, and layout for universal document processing

Textmonkey: An ocr-free large multimodal model for understanding document

Attention where it matters: Rethinking visual document understanding with selective region concentration

Layoutlmv2: Multi-modal pre-training for visually-rich document understanding

引用