PaliGemma: A versatile 3B VLM for transfer

L Beyer, A Steiner, AS Pinto, A Kolesnikov… - arXiv preprint arXiv …, 2024 - arxiv.org
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …

[PDF][PDF] GPT-4 对多模态大模型在多模态理解, 生成, 交互上的启发

刘静, 郭龙腾 - 中国科学基金, 2023 - nsfc.gov.cn
对话式聊天机器人ChatGPT 以近乎摧枯拉朽的气势席卷社会, 拨开了通用人工智能的曙光.
ChatGPT 的升级版GPT-4 是个多模态大模型, 它从单调的文本交互, 升级为可以接受文本与图像 …

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

Y Ma, Y Zang, L Chen, M Chen, Y Jiao, X Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Understanding documents with rich layouts and multi-modal components is a long-standing
and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable …

Wavelet-Based Image Tokenizer for Vision Transformers

Z Zhu, R Soricut - arXiv preprint arXiv:2405.18616, 2024 - arxiv.org
Non-overlapping patch-wise convolution is the default image tokenizer for all state-of-the-art
vision Transformer (ViT) models. Even though many ViT variants have been proposed to …

Inference and Reasoning for Semi-Structured Tables

V Gupta - 2023 - search.proquest.com
Semi-structured tabular data, such as ones in e-commerce product descriptions, annual
financial reports, sports score statistics, scientific articles, etc., are ubiquitous in real-world …

[PDF][PDF] Receipt-AVQA-2023 Challenge

A Begaev, E Orlov - Proceedings of the International Conference “ …, 2023 - dialog-21.ru
In this work, we introduce a new challenging Document VQA dataset, named Receipt AVQA,
and present the results of the associated RECEIPT-AVQA-2023 shared task. Receipt AVQA …