Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

W Shi, Z Hu, Y Bin, J Liu, Y Yang, SK Ng, L Bing… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) have demonstrated impressive reasoning capabilities,
particularly in textual mathematical problem-solving. However, existing open-source image …

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

X He, L Wei, L Xie, Q Tian - arXiv preprint arXiv:2401.03105, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) are experiencing rapid growth, yielding a
plethora of noteworthy contributions in recent months. The prevailing trend involves …

The state of the art in creating visualization corpora for automated chart analysis

C Chen, Z Liu - Computer Graphics Forum, 2023 - Wiley Online Library
We present a state‐of‐the‐art report on visualization corpora in automated chart analysis
research. We survey 56 papers that created or used a visualization corpus as the input of …

Fintral: A family of gpt-4 level multimodal financial large language models

G Bhatia, EMB Nagoudi, H Cavusoglu… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce FinTral, a suite of state-of-the-art multimodal large language models (LLMs)
built upon the Mistral-7b model and tailored for financial analysis. FinTral integrates textual …

Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning

R Xia, B Zhang, H Ye, X Yan, Q Liu, H Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org
Recently, many versatile Multi-modal Large Language Models (MLLMs) have emerged
continuously. However, their capacity to query information depicted in visual charts and …

FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback

A Singh, P Agarwal, Z Huang, A Singh, T Yu… - arXiv preprint arXiv …, 2023 - arxiv.org
Captions are crucial for understanding scientific visualizations and documents. Existing
captioning methods for scientific figures rely on figure-caption pairs extracted from …

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

J Tang, Q Liu, Y Ye, J Lu, S Wei, C Lin, W Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates
human-machine interaction in text-centric visual environments but also serves as a de facto …

OneChart: Purify the Chart Structural Extraction via One Auxiliary Token

J Chen, L Kong, H Wei, C Liu, Z Ge, L Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
Chart parsing poses a significant challenge due to the diversity of styles, values, texts, and
so forth. Even advanced large vision-language models (LVLMs) with billions of parameters …

PaliGemma: A versatile 3B VLM for transfer

L Beyer, A Steiner, AS Pinto, A Kolesnikov… - arXiv preprint arXiv …, 2024 - arxiv.org
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

Y Deng, P Lu, F Yin, Z Hu, S Shen, J Zou… - arXiv preprint arXiv …, 2024 - arxiv.org
Large vision language models (LVLMs) integrate large language models (LLMs) with pre-
trained vision encoders, thereby activating the perception capability of the model to …