Large Language Models (LLMs) have issues with document question answering (QA) in situations where the document is unable to fit in the small context length of an LLM. To …
We study the problem of completing various visual document understanding (VDU) tasks, eg, question answering and information extraction, on real-world documents through human …
This paper presents Visual CoT, a novel pipeline that leverages the reasoning capabilities of multi-modal large language models (MLLMs) by incorporating visual Chain-of-Thought …
Document Visual Question Answering (DocVQA) is a fast growing branch of document understanding. Despite the fact that documents contain sensitive or copyrighted information …
M Huang, H Li, Y Liu, X Bai… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Modularity plays a crucial role in the development and maintenance of complex systems. While end-to-end text spotting efficiently mitigates the issues of error accumulation and sub …
Recent advancements in AI have led to the development of large multimodal models (LMMs) capable of processing complex tasks involving joint reasoning over text and visual content in …
This paper presents the results of the ICDAR 2023 competition on Document UnderstanDing of Everything. DUDE introduces a new dataset comprising 5 K visually-rich documents …
This paper highlights the need to bring document classification benchmarking closer to real- world applications, both in the nature of data tested (X: multi-channel, multi-paged, multi …
D Peer, P Schöpf, V Nebendahl, A Rietzler… - arXiv preprint arXiv …, 2024 - arxiv.org
Traditionally, discriminative models have been the predominant choice for tasks like document classification and information extraction. These models make predictions that fall …