Mini-gemini: Mining the potential of multi-modality vision language models

Y Li, Y Zhang, C Wang, Z Zhong, Y Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-
modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating …

Document understanding dataset and evaluation (dude)

J Van Landeghem, R Tito… - Proceedings of the …, 2023 - openaccess.thecvf.com
We call on the Document AI (DocAI) community to re-evaluate current methodologies and
embrace the challenge of creating more practically-oriented benchmarks. Document …

Hierarchical multimodal transformers for multipage docvqa

R Tito, D Karatzas, E Valveny - Pattern Recognition, 2023 - Elsevier
Existing work on DocVQA only considers single-page documents. However, in real
applications documents are mostly composed of multiple pages that should be processed …

Slidevqa: A dataset for document visual question answering on multiple images

R Tanaka, K Nishida, K Nishida, T Hasegawa… - Proceedings of the …, 2023 - ojs.aaai.org
Visual question answering on document images that contain textual, visual, and layout
information, called document VQA, has received much attention recently. Although many …

Towards video text visual question answering: Benchmark and baseline

M Zhao, B Li, J Wang, W Li, W Zhou… - Advances in …, 2022 - proceedings.neurips.cc
There are already some text-based visual question answering (TextVQA) benchmarks for
developing machine's ability to answer questions based on texts in images in recent years …

A multi-modal neural geometric solver with textual clauses parsed from diagram

ML Zhang, F Yin, CL Liu - arXiv preprint arXiv:2302.11097, 2023 - arxiv.org
Geometry problem solving (GPS) is a high-level mathematical reasoning requiring the
capacities of multi-modal fusion and geometric knowledge application. Recently, neural …

Ocr-idl: Ocr annotations for industry document library dataset

AF Biten, R Tito, L Gomez, E Valveny… - European Conference on …, 2022 - Springer
Pretraining has proven successful in Document Intelligence tasks where deluge of
documents are used to pretrain the models only later to be finetuned on downstream tasks …

Cumo: Scaling multimodal llm with co-upcycled mixture-of-experts

J Li, X Wang, S Zhu, CW Kuo, L Xu, F Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent advancements in Multimodal Large Language Models (LLMs) have focused
primarily on scaling by increasing text-image pair data and enhancing LLMs to improve …

Watching the news: Towards videoqa models that can read

S Jahagirdar, M Mathew, D Karatzas… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Video Question Answering methods focus on common-sense reasoning and visual
cognition of objects or persons and their interactions over time. Current VideoQA …

Icdar 2021 competition on document visual question answering

R Tito, M Mathew, CV Jawahar, E Valveny… - Document Analysis and …, 2021 - Springer
In this report we present results of the ICDAR 2021 edition of the Document Visual Question
Challenges. This edition complements the previous tasks on Single Document VQA and …