Document understanding dataset and evaluation (dude)

D Wang, N Raman, M Sibue, Z Ma, P Babkin… - arXiv preprint arXiv …, 2023 - arxiv.org

Enterprise documents such as forms, invoices, receipts, reports, contracts, and other similar
records, often carry rich semantics at the intersection of textual and spatial modalities. The …

被引用次数：18 相关文章所有 2 个版本

[PDF] arxiv.org

PDFTriage: question answering over long, structured documents

J Saad-Falcon, J Barrow, A Siu, A Nenkova… - arXiv preprint arXiv …, 2023 - arxiv.org

Large Language Models (LLMs) have issues with document question answering (QA) in
situations where the document is unable to fit in the small context length of an LLM. To …

被引用次数：11 相关文章所有 2 个版本

[PDF] aaai.org

Instructdoc: A dataset for zero-shot generalization of visual document understanding with instructions

R Tanaka, T Iki, K Nishida, K Saito… - Proceedings of the AAAI …, 2024 - ojs.aaai.org

We study the problem of completing various visual document understanding (VDU) tasks,
eg, question answering and information extraction, on real-world documents through human …

被引用次数：5 相关文章所有 4 个版本

[PDF] arxiv.org

Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models

H Shao, S Qian, H Xiao, G Song, Z Zong… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper presents Visual CoT, a novel pipeline that leverages the reasoning capabilities of
multi-modal large language models (MLLMs) by incorporating visual Chain-of-Thought …

被引用次数：6 相关文章所有 3 个版本

[PDF] arxiv.org

Privacy-aware document visual question answering

R Tito, K Nguyen, M Tobaben, R Kerkouche… - arXiv preprint arXiv …, 2023 - arxiv.org

Document Visual Question Answering (DocVQA) is a fast growing branch of document
understanding. Despite the fact that documents contain sensitive or copyrighted information …

被引用次数：4 相关文章所有 2 个版本

[PDF] thecvf.com

Bridging the Gap Between End-to-End and Two-Step Text Spotting

M Huang, H Li, Y Liu, X Bai… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Modularity plays a crucial role in the development and maintenance of complex systems.
While end-to-end text spotting efficiently mitigates the issues of error accumulation and sub …

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

R Wadhawan, H Bansal, KW Chang, N Peng - arXiv preprint arXiv …, 2024 - arxiv.org

Recent advancements in AI have led to the development of large multimodal models (LMMs)
capable of processing complex tasks involving joint reasoning over text and visual content in …

被引用次数：7 相关文章所有 3 个版本

[HTML] pw.edu.pl

ICDAR 2023 competition on document understanding of everything (DUDE)

J Van Landeghem, R Tito, Ł Borchmann… - … on Document Analysis …, 2023 - Springer

This paper presents the results of the ICDAR 2023 competition on Document UnderstanDing
of Everything. DUDE introduces a new dataset comprising 5 K visually-rich documents …

被引用次数：5 相关文章所有 4 个版本

[PDF] thecvf.com

Beyond Document Page Classification: Design, Datasets, and Challenges

J Van Landeghem, S Biswas… - Proceedings of the …, 2024 - openaccess.thecvf.com

This paper highlights the need to bring document classification benchmarking closer to real-
world applications, both in the nature of data tested (X: multi-channel, multi-paged, multi …

被引用次数：1 相关文章所有 7 个版本

[PDF] arxiv.org

ANLS*--A Universal Document Processing Metric for Generative Large Language Models

D Peer, P Schöpf, V Nebendahl, A Rietzler… - arXiv preprint arXiv …, 2024 - arxiv.org

Traditionally, discriminative models have been the predominant choice for tasks like
document classification and information extraction. These models make predictions that fall …

被引用次数：1 相关文章所有 2 个版本

高级搜索

QQ 群