This work presents DocPedia, a novel large multimodal model (LMM) for versatile OCR-free document understanding, capable of parsing images up to 2,560$\times $2,560 resolution …
Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction …
Z Hu, P Yang, Y Jiang, Z Bai - Pattern Recognition, 2024 - Elsevier
Abstract Existing studies apply Large Language Model (LLM) to knowledge-based Visual Question Answering (VQA) with encouraging results. Due to the insufficient input …
Document Visual Question Answering (DocVQA) is a fast growing branch of document understanding. Despite the fact that documents contain sensitive or copyrighted information …
W Wang, Y Li, Y Ou, Y Zhang - arXiv preprint arXiv:2306.00526, 2023 - arxiv.org
Layout-aware pre-trained models has achieved significant progress on document image question answering. They introduce extra learnable modules into existing language models …
A Sassioui, R Benouini, Y El Ouargui… - … Networks and Mobile …, 2023 - ieeexplore.ieee.org
The increasing prevalence of Visually-rich Documents (VRDs) in diverse domains has led to a growing interest in Visually-rich Document Understanding (VrDU). Researchers have …
Document layout analysis is a known problem to the documents research community and has been vastly explored yielding a multitude of solutions ranging from text mining, and …
This paper highlights the need to bring document classification benchmarking closer to real- world applications, both in the nature of data tested (X: multi-channel, multi-paged, multi …
C Chen, L Lin, Y Chen, B Li, J Zeng… - Proceedings of the …, 2024 - openaccess.thecvf.com
The rebroadcasting of screen-recaptured document images introduces a significant risk to the confidential documents processed in government departments and commercial …