Hierarchical multimodal transformers for multipage docvqa

J Van Landeghem, R Tito… - Proceedings of the …, 2023 - openaccess.thecvf.com

We call on the Document AI (DocAI) community to re-evaluate current methodologies and
embrace the challenge of creating more practically-oriented benchmarks. Document …

被引用次数：21 相关文章所有 9 个版本

[PDF] arxiv.org

Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding

H Feng, Q Liu, H Liu, W Zhou, H Li, C Huang - arXiv preprint arXiv …, 2023 - arxiv.org

This work presents DocPedia, a novel large multimodal model (LMM) for versatile OCR-free
document understanding, capable of parsing images up to 2,560$\times $2,560 resolution …

被引用次数：20 相关文章所有 2 个版本

[PDF] arxiv.org

Screenai: A vision-language model for ui and infographics understanding

G Baechler, S Sunkara, M Wang, F Zubach… - arXiv preprint arXiv …, 2024 - arxiv.org

Screen user interfaces (UIs) and infographics, sharing similar visual language and design
principles, play important roles in human communication and human-machine interaction …

被引用次数：8 相关文章所有 4 个版本

Prompting large language model with context and pre-answer for knowledge-based VQA

Z Hu, P Yang, Y Jiang, Z Bai - Pattern Recognition, 2024 - Elsevier

Abstract Existing studies apply Large Language Model (LLM) to knowledge-based Visual
Question Answering (VQA) with encouraging results. Due to the insufficient input …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Privacy-aware document visual question answering

R Tito, K Nguyen, M Tobaben, R Kerkouche… - arXiv preprint arXiv …, 2023 - arxiv.org

Document Visual Question Answering (DocVQA) is a fast growing branch of document
understanding. Despite the fact that documents contain sensitive or copyrighted information …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

Layout and task aware instruction prompt for zero-shot document image question answering

W Wang, Y Li, Y Ou, Y Zhang - arXiv preprint arXiv:2306.00526, 2023 - arxiv.org

Layout-aware pre-trained models has achieved significant progress on document image
question answering. They introduce extra learnable modules into existing language models …

被引用次数：9 相关文章所有 2 个版本

Visually-Rich Document Understanding: Concepts, Taxonomy and Challenges

A Sassioui, R Benouini, Y El Ouargui… - … Networks and Mobile …, 2023 - ieeexplore.ieee.org

The increasing prevalence of Visually-rich Documents (VRDs) in diverse domains has led to
a growing interest in Visually-rich Document Understanding (VrDU). Researchers have …

被引用次数：1 相关文章

[PDF] arxiv.org

Selfdocseg: A self-supervised vision-based approach towards document segmentation

S Maity, S Biswas, S Manna, A Banerjee… - … on Document Analysis …, 2023 - Springer

Document layout analysis is a known problem to the documents research community and
has been vastly explored yielding a multitude of solutions ranging from text mining, and …

被引用次数：5 相关文章所有 5 个版本

[PDF] thecvf.com

Beyond Document Page Classification: Design, Datasets, and Challenges

J Van Landeghem, S Biswas… - Proceedings of the …, 2024 - openaccess.thecvf.com

This paper highlights the need to bring document classification benchmarking closer to real-
world applications, both in the nature of data tested (X: multi-channel, multi-paged, multi …

被引用次数：1 相关文章所有 7 个版本

[PDF] thecvf.com

CMA: A Chromaticity Map Adapter for Robust Detection of Screen-Recapture Document Images

C Chen, L Lin, Y Chen, B Li, J Zeng… - Proceedings of the …, 2024 - openaccess.thecvf.com

The rebroadcasting of screen-recaptured document images introduces a significant risk to
the confidential documents processed in government departments and commercial …

被引用次数：1 相关文章

高级搜索

QQ 群