Chartqa: A benchmark for question answering about charts with visual and logical reasoning

W Shi, Z Hu, Y Bin, J Liu, Y Yang, SK Ng, L Bing… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language models (LLMs) have demonstrated impressive reasoning capabilities,
particularly in textual mathematical problem-solving. However, existing open-source image …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

X He, L Wei, L Xie, Q Tian - arXiv preprint arXiv:2401.03105, 2024 - arxiv.org

Multimodal Large Language Models (MLLMs) are experiencing rapid growth, yielding a
plethora of noteworthy contributions in recent months. The prevailing trend involves …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

The state of the art in creating visualization corpora for automated chart analysis

C Chen, Z Liu - Computer Graphics Forum, 2023 - Wiley Online Library

We present a state‐of‐the‐art report on visualization corpora in automated chart analysis
research. We survey 56 papers that created or used a visualization corpus as the input of …

被引用次数：8 相关文章所有 11 个版本

[PDF] arxiv.org

Fintral: A family of gpt-4 level multimodal financial large language models

G Bhatia, EMB Nagoudi, H Cavusoglu… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce FinTral, a suite of state-of-the-art multimodal large language models (LLMs)
built upon the Mistral-7b model and tailored for financial analysis. FinTral integrates textual …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning

R Xia, B Zhang, H Ye, X Yan, Q Liu, H Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org

Recently, many versatile Multi-modal Large Language Models (MLLMs) have emerged
continuously. However, their capacity to query information depicted in visual charts and …

被引用次数：9 相关文章所有 3 个版本

[PDF] arxiv.org

FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback

A Singh, P Agarwal, Z Huang, A Singh, T Yu… - arXiv preprint arXiv …, 2023 - arxiv.org

Captions are crucial for understanding scientific visualizations and documents. Existing
captioning methods for scientific figures rely on figure-caption pairs extracted from …

被引用次数：6 相关文章所有 3 个版本

[PDF] arxiv.org

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

J Tang, Q Liu, Y Ye, J Lu, S Wei, C Lin, W Li… - arXiv preprint arXiv …, 2024 - arxiv.org

Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates
human-machine interaction in text-centric visual environments but also serves as a de facto …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

OneChart: Purify the Chart Structural Extraction via One Auxiliary Token

J Chen, L Kong, H Wei, C Liu, Z Ge, L Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org

Chart parsing poses a significant challenge due to the diversity of styles, values, texts, and
so forth. Even advanced large vision-language models (LVLMs) with billions of parameters …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

PaliGemma: A versatile 3B VLM for transfer

L Beyer, A Steiner, AS Pinto, A Kolesnikov… - arXiv preprint arXiv …, 2024 - arxiv.org

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

Y Deng, P Lu, F Yin, Z Hu, S Shen, J Zou… - arXiv preprint arXiv …, 2024 - arxiv.org

Large vision language models (LVLMs) integrate large language models (LLMs) with pre-
trained vision encoders, thereby activating the perception capability of the model to …

被引用次数：1 相关文章所有 2 个版本

高级搜索

QQ 群