Exploring ocr capabilities of gpt-4v (ision): A quantitative and in-depth evaluation

Y Shi, D Peng, W Liao, Z Lin, X Chen, C Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
This paper presents a comprehensive evaluation of the Optical Character Recognition
(OCR) capabilities of the recently released GPT-4V (ision), a Large Multimodal Model …

Choose What You Need: Disentangled Representation Learning for Scene Text Recognition Removal and Editing

B Zhang, H Xie, Z Gao, Y Wang - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Scene text images contain not only style information (font background) but also content
information (character texture). Different scene text tasks need different information but …

Unified hallucination detection for multimodal large language models

X Chen, C Wang, Y Xue, N Zhang, X Yang, Q Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite significant strides in multimodal tasks, Multimodal Large Language Models (MLLMs)
are plagued by the critical issue of hallucination. The reliable detection of such …

CLIP4STR: A simple baseline for scene text recognition with pre-trained vision-language model

S Zhao, R Quan, L Zhu, Y Yang - arXiv preprint arXiv:2305.14014, 2023 - arxiv.org
Pre-trained vision-language models~(VLMs) are the de-facto foundation models for various
downstream tasks. However, scene text recognition methods still prefer backbones pre …

Bridging the Gap Between End-to-End and Two-Step Text Spotting

M Huang, H Li, Y Liu, X Bai… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Modularity plays a crucial role in the development and maintenance of complex systems.
While end-to-end text spotting efficiently mitigates the issues of error accumulation and sub …

OTE: Exploring Accurate Scene Text Recognition Using One Token

J Xu, Y Wang, H Xie, Y Zhang - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
In this paper we propose a novel framework to fully exploit the potential of a single vector for
scene text recognition (STR). Different from previous sequence-to-sequence methods that …

Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer

Z Zhao, J Tang, C Lin, B Wu, C Huang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Scene text recognition (STR) in the wild frequently encounters challenges when coping with
domain variations font diversity shape deformations etc. A straightforward solution is …

Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition

Z Gao, Y Wang, Y Qu, B Zhang, Z Wang, J Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
In text recognition, self-supervised pre-training emerges as a good solution to reduce
dependence on expansive annotated real data. Previous studies primarily focus on local …

Masked and permuted implicit context learning for scene text recognition

X Yang, Z Qiao, J Wei, D Yang… - IEEE Signal Processing …, 2024 - ieeexplore.ieee.org
Scene Text Recognition (STR) is challenging because of various text styles, shapes, and
backgrounds. Although the integration of linguistic information enhances models' …

TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model

J Lyu, J Wei, G Zeng, Z Li, E Xie, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
Existing scene text spotters are designed to locate and transcribe texts from images.
However, it is challenging for a spotter to achieve precise detection and recognition of scene …