Git: A generative image-to-text transformer for vision and language

J Wang, Z Yang, X Hu, L Li, K Lin, Z Gan, Z Liu… - arXiv preprint arXiv …, 2022 - arxiv.org
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify
vision-language tasks such as image/video captioning and question answering. While …

Trocr: Transformer-based optical character recognition with pre-trained models

M Li, T Lv, J Chen, L Cui, Y Lu, D Florencio… - Proceedings of the …, 2023 - ojs.aaai.org
Text recognition is a long-standing research problem for document digitalization. Existing
approaches are usually built based on CNN for image understanding and RNN for char …

Revisiting scene text recognition: A data perspective

Q Jiang, J Wang, D Peng, C Liu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
This paper aims to re-assess scene text recognition (STR) from a data-oriented perspective.
We begin by revisiting the six commonly used benchmarks in STR and observe a trend of …

Masked modeling for self-supervised representation learning on vision and beyond

S Li, L Zhang, Z Wang, D Wu, L Wu, Z Liu, J Xia… - arXiv preprint arXiv …, 2023 - arxiv.org
As the deep learning revolution marches on, self-supervised learning has garnered
increasing attention in recent years thanks to its remarkable representation learning ability …

Choose What You Need: Disentangled Representation Learning for Scene Text Recognition Removal and Editing

B Zhang, H Xie, Z Gao, Y Wang - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Scene text images contain not only style information (font background) but also content
information (character texture). Different scene text tasks need different information but …

Dtrocr: Decoder-only transformer for optical character recognition

M Fujitake - Proceedings of the IEEE/CVF Winter …, 2024 - openaccess.thecvf.com
Typical text recognition methods rely on an encoder-decoder structure, in which the encoder
extracts features from an image, and the decoder produces recognized text from these …

Symmetrical linguistic feature distillation with clip for scene text recognition

Z Wang, H Xie, Y Wang, J Xu, B Zhang… - Proceedings of the 31st …, 2023 - dl.acm.org
In this paper, we explore the potential of the Contrastive Language-Image Pretraining (CLIP)
model in scene text recognition (STR), and establish a novel Symmetrical Linguistic Feature …

Clipter: Looking at the bigger picture in scene text recognition

A Aberdam, D Bensaïd, A Golts… - Proceedings of the …, 2023 - openaccess.thecvf.com
Reading text in real-world scenarios often requires understanding the context surrounding it,
especially when dealing with poor-quality text. However, current scene text recognizers are …

Hiercode: A lightweight hierarchical codebook for zero-shot chinese text recognition

Y Zhang, Y Zhu, D Peng, P Zhang, Z Yang, Z Yang… - Pattern Recognition, 2025 - Elsevier
Text recognition, especially for complex scripts like Chinese, faces unique challenges due to
its intricate character structures and vast vocabulary. Traditional one-hot encoding methods …

Hierarchical visual-semantic interaction for scene text recognition

L Diao, X Tang, J Wang, G Xie, J Hu - Information Fusion, 2024 - Elsevier
Proper interaction between visual and semantic features is crucial to obtain a powerful
feature representation for scene text recognition (STR). The existing interaction methods …