- 学术资源搜索

Git: A generative image-to-text transformer for vision and language

J Wang, Z Yang, X Hu, L Li, K Lin, Z Gan, Z Liu… - arXiv preprint arXiv …, 2022 - arxiv.org

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify
vision-language tasks such as image/video captioning and question answering. While …

被引用次数：559 相关文章所有 4 个版本

[PDF] aaai.org

Trocr: Transformer-based optical character recognition with pre-trained models

M Li, T Lv, J Chen, L Cui, Y Lu, D Florencio… - Proceedings of the …, 2023 - ojs.aaai.org

Text recognition is a long-standing research problem for document digitalization. Existing
approaches are usually built based on CNN for image understanding and RNN for char …

被引用次数：426 相关文章所有 4 个版本

[PDF] thecvf.com

Revisiting scene text recognition: A data perspective

Q Jiang, J Wang, D Peng, C Liu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

This paper aims to re-assess scene text recognition (STR) from a data-oriented perspective.
We begin by revisiting the six commonly used benchmarks in STR and observe a trend of …

被引用次数：42 相关文章所有 5 个版本

[PDF] arxiv.org

Masked modeling for self-supervised representation learning on vision and beyond

S Li, L Zhang, Z Wang, D Wu, L Wu, Z Liu, J Xia… - arXiv preprint arXiv …, 2023 - arxiv.org

As the deep learning revolution marches on, self-supervised learning has garnered
increasing attention in recent years thanks to its remarkable representation learning ability …

被引用次数：9 相关文章所有 2 个版本

[PDF] thecvf.com

Choose What You Need: Disentangled Representation Learning for Scene Text Recognition Removal and Editing

B Zhang, H Xie, Z Gao, Y Wang - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Scene text images contain not only style information (font background) but also content
information (character texture). Different scene text tasks need different information but …

被引用次数：9 相关文章所有 3 个版本

[PDF] thecvf.com

Dtrocr: Decoder-only transformer for optical character recognition

M Fujitake - Proceedings of the IEEE/CVF Winter …, 2024 - openaccess.thecvf.com

Typical text recognition methods rely on an encoder-decoder structure, in which the encoder
extracts features from an image, and the decoder produces recognized text from these …

被引用次数：40 相关文章所有 7 个版本

[PDF] arxiv.org

Symmetrical linguistic feature distillation with clip for scene text recognition

Z Wang, H Xie, Y Wang, J Xu, B Zhang… - Proceedings of the 31st …, 2023 - dl.acm.org

In this paper, we explore the potential of the Contrastive Language-Image Pretraining (CLIP)
model in scene text recognition (STR), and establish a novel Symmetrical Linguistic Feature …

被引用次数：22 相关文章所有 4 个版本

[PDF] thecvf.com

Clipter: Looking at the bigger picture in scene text recognition

A Aberdam, D Bensaïd, A Golts… - Proceedings of the …, 2023 - openaccess.thecvf.com

Reading text in real-world scenarios often requires understanding the context surrounding it,
especially when dealing with poor-quality text. However, current scene text recognizers are …

被引用次数：20 相关文章所有 8 个版本

[PDF] arxiv.org

Hiercode: A lightweight hierarchical codebook for zero-shot chinese text recognition

Y Zhang, Y Zhu, D Peng, P Zhang, Z Yang, Z Yang… - Pattern Recognition, 2025 - Elsevier

Text recognition, especially for complex scripts like Chinese, faces unique challenges due to
its intricate character structures and vast vocabulary. Traditional one-hot encoding methods …

被引用次数：3 相关文章所有 2 个版本

Hierarchical visual-semantic interaction for scene text recognition

L Diao, X Tang, J Wang, G Xie, J Hu - Information Fusion, 2024 - Elsevier

Proper interaction between visual and semantic features is crucial to obtain a powerful
feature representation for scene text recognition (STR). The existing interaction methods …

被引用次数：4 相关文章所有 2 个版本

高级搜索

QQ 群

Git: A generative image-to-text transformer for vision and language

Trocr: Transformer-based optical character recognition with pre-trained models

Revisiting scene text recognition: A data perspective

Masked modeling for self-supervised representation learning on vision and beyond

Choose What You Need: Disentangled Representation Learning for Scene Text Recognition Removal and Editing

Dtrocr: Decoder-only transformer for optical character recognition

Symmetrical linguistic feature distillation with clip for scene text recognition

Clipter: Looking at the bigger picture in scene text recognition

Hiercode: A lightweight hierarchical codebook for zero-shot chinese text recognition

Hierarchical visual-semantic interaction for scene text recognition

引用