Document collection visual question answering

Y Li, Y Zhang, C Wang, Z Zhong, Y Chen… - arXiv preprint arXiv …, 2024 - arxiv.org

In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-
modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating …

被引用次数：53 相关文章所有 3 个版本

[PDF] thecvf.com

Document understanding dataset and evaluation (dude)

J Van Landeghem, R Tito… - Proceedings of the …, 2023 - openaccess.thecvf.com

We call on the Document AI (DocAI) community to re-evaluate current methodologies and
embrace the challenge of creating more practically-oriented benchmarks. Document …

被引用次数：21 相关文章所有 9 个版本

[PDF] arxiv.org

Hierarchical multimodal transformers for multipage docvqa

R Tito, D Karatzas, E Valveny - Pattern Recognition, 2023 - Elsevier

Existing work on DocVQA only considers single-page documents. However, in real
applications documents are mostly composed of multiple pages that should be processed …

被引用次数：28 相关文章所有 6 个版本

[PDF] aaai.org

Slidevqa: A dataset for document visual question answering on multiple images

R Tanaka, K Nishida, K Nishida, T Hasegawa… - Proceedings of the …, 2023 - ojs.aaai.org

Visual question answering on document images that contain textual, visual, and layout
information, called document VQA, has received much attention recently. Although many …

被引用次数：19 相关文章所有 5 个版本

[PDF] neurips.cc

Towards video text visual question answering: Benchmark and baseline

M Zhao, B Li, J Wang, W Li, W Zhou… - Advances in …, 2022 - proceedings.neurips.cc

There are already some text-based visual question answering (TextVQA) benchmarks for
developing machine's ability to answer questions based on texts in images in recent years …

被引用次数：18 相关文章所有 7 个版本

[PDF] arxiv.org

A multi-modal neural geometric solver with textual clauses parsed from diagram

ML Zhang, F Yin, CL Liu - arXiv preprint arXiv:2302.11097, 2023 - arxiv.org

Geometry problem solving (GPS) is a high-level mathematical reasoning requiring the
capacities of multi-modal fusion and geometric knowledge application. Recently, neural …

被引用次数：15 相关文章所有 4 个版本

[PDF] arxiv.org

Ocr-idl: Ocr annotations for industry document library dataset

AF Biten, R Tito, L Gomez, E Valveny… - European Conference on …, 2022 - Springer

Pretraining has proven successful in Document Intelligence tasks where deluge of
documents are used to pretrain the models only later to be finetuned on downstream tasks …

被引用次数：22 相关文章所有 8 个版本

[PDF] arxiv.org

Cumo: Scaling multimodal llm with co-upcycled mixture-of-experts

J Li, X Wang, S Zhu, CW Kuo, L Xu, F Chen… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent advancements in Multimodal Large Language Models (LLMs) have focused
primarily on scaling by increasing text-image pair data and enhancing LLMs to improve …

被引用次数：5 相关文章所有 2 个版本

[PDF] thecvf.com

Watching the news: Towards videoqa models that can read

S Jahagirdar, M Mathew, D Karatzas… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Video Question Answering methods focus on common-sense reasoning and visual
cognition of objects or persons and their interactions over time. Current VideoQA …

被引用次数：13 相关文章所有 6 个版本

[PDF] arxiv.org

Icdar 2021 competition on document visual question answering

R Tito, M Mathew, CV Jawahar, E Valveny… - Document Analysis and …, 2021 - Springer

In this report we present results of the ICDAR 2021 edition of the Document Visual Question
Challenges. This edition complements the previous tasks on Single Document VQA and …

被引用次数：24 相关文章所有 8 个版本

高级搜索

QQ 群