DocLLM: A layout-aware generative language model for multimodal document understanding

D Wang, N Raman, M Sibue, Z Ma, P Babkin… - arXiv preprint arXiv …, 2023 - arxiv.org
Enterprise documents such as forms, invoices, receipts, reports, contracts, and other similar
records, often carry rich semantics at the intersection of textual and spatial modalities. The …

PDFTriage: question answering over long, structured documents

J Saad-Falcon, J Barrow, A Siu, A Nenkova… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Language Models (LLMs) have issues with document question answering (QA) in
situations where the document is unable to fit in the small context length of an LLM. To …

Instructdoc: A dataset for zero-shot generalization of visual document understanding with instructions

R Tanaka, T Iki, K Nishida, K Saito… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
We study the problem of completing various visual document understanding (VDU) tasks,
eg, question answering and information extraction, on real-world documents through human …

Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models

H Shao, S Qian, H Xiao, G Song, Z Zong… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper presents Visual CoT, a novel pipeline that leverages the reasoning capabilities of
multi-modal large language models (MLLMs) by incorporating visual Chain-of-Thought …

Privacy-aware document visual question answering

R Tito, K Nguyen, M Tobaben, R Kerkouche… - arXiv preprint arXiv …, 2023 - arxiv.org
Document Visual Question Answering (DocVQA) is a fast growing branch of document
understanding. Despite the fact that documents contain sensitive or copyrighted information …

Bridging the Gap Between End-to-End and Two-Step Text Spotting

M Huang, H Li, Y Liu, X Bai… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Modularity plays a crucial role in the development and maintenance of complex systems.
While end-to-end text spotting efficiently mitigates the issues of error accumulation and sub …

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

R Wadhawan, H Bansal, KW Chang, N Peng - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advancements in AI have led to the development of large multimodal models (LMMs)
capable of processing complex tasks involving joint reasoning over text and visual content in …

ICDAR 2023 competition on document understanding of everything (DUDE)

J Van Landeghem, R Tito, Ł Borchmann… - … on Document Analysis …, 2023 - Springer
This paper presents the results of the ICDAR 2023 competition on Document UnderstanDing
of Everything. DUDE introduces a new dataset comprising 5 K visually-rich documents …

Beyond Document Page Classification: Design, Datasets, and Challenges

J Van Landeghem, S Biswas… - Proceedings of the …, 2024 - openaccess.thecvf.com
This paper highlights the need to bring document classification benchmarking closer to real-
world applications, both in the nature of data tested (X: multi-channel, multi-paged, multi …

ANLS*--A Universal Document Processing Metric for Generative Large Language Models

D Peer, P Schöpf, V Nebendahl, A Rietzler… - arXiv preprint arXiv …, 2024 - arxiv.org
Traditionally, discriminative models have been the predominant choice for tasks like
document classification and information extraction. These models make predictions that fall …