Llava-onevision: Easy visual task transfer

B Li, Y Zhang, D Guo, R Zhang, F Li, H Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed
by consolidating our insights into data, models, and visual representations in the LLaVA …

On the hidden mystery of ocr in large multimodal models

Y Liu, Z Li, B Yang, C Li, X Yin, C Liu, L Jin… - arXiv preprint arXiv …, 2023 - arxiv.org
Large models have recently played a dominant role in natural language processing and
multimodal vision-language learning. However, their effectiveness in text-related visual …

Mathematics intelligent tutoring systems with handwritten input: A scoping review

L Rodrigues, FD Pereira, M Marinho, V Macario… - Education and …, 2024 - Springer
Abstract Intelligent Tutoring Systems (ITS) have been widely used to enhance math learning,
wherein teacher's involvement is prominent to achieve their full potential. Usually, ITSs …

When counting meets HMER: counting-aware network for handwritten mathematical expression recognition

B Li, Y Yuan, D Liang, X Liu, Z Ji, J Bai, W Liu… - European conference on …, 2022 - Springer
Recently, most handwritten mathematical expression recognition (HMER) methods adopt
the encoder-decoder networks, which directly predict the markup sequences from formula …

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

K Ying, F Meng, J Wang, Z Li, H Lin, Y Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Vision-Language Models (LVLMs) show significant strides in general-purpose
multimodal applications such as visual dialogue and embodied navigation. However …

Nvlm: Open frontier-class multimodal llms

W Dai, N Lee, B Wang, Z Yang, Z Liu, J Barker… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs)
that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary …

Exploring ocr capabilities of gpt-4v (ision): A quantitative and in-depth evaluation

Y Shi, D Peng, W Liao, Z Lin, X Chen, C Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
This paper presents a comprehensive evaluation of the Optical Character Recognition
(OCR) capabilities of the recently released GPT-4V (ision), a Large Multimodal Model …

A survey on handwritten mathematical expression recognition: The rise of encoder-decoder and GNN models

TN Truong, CT Nguyen, R Zanibbi, H Mouchère… - Pattern Recognition, 2024 - Elsevier
Recognition of handwritten mathematical expressions (HMEs) has attracted growing interest
due to steady progress in handwriting recognition techniques and the rapid emergence of …

Texthawk2: A large vision-language model excels in bilingual ocr and grounding with 16x fewer tokens

YQ Yu, M Liao, J Zhang, J Wu - arXiv preprint arXiv:2410.05261, 2024 - arxiv.org
Reading dense text and locating objects within images are fundamental abilities for Large
Vision-Language Models (LVLMs) tasked with advanced jobs. Previous LVLMs, including …