Tools, techniques, datasets and application areas for object detection in an image: a review

J Kaur, W Singh - Multimedia Tools and Applications, 2022 - Springer
Object detection is one of the most fundamental and challenging tasks to locate objects in
images and videos. Over the past, it has gained much attention to do more research on …

Scene text detection and recognition: The deep learning era

S Long, X He, C Yao - International Journal of Computer Vision, 2021 - Springer
With the rise and development of deep learning, computer vision has been tremendously
transformed and reshaped. As an important research area in computer vision, scene text …

Obelics: An open web-scale filtered dataset of interleaved image-text documents

H Laurençon, L Saulnier, L Tronchon… - Advances in …, 2024 - proceedings.neurips.cc
Large multimodal models trained on natural documents, which interleave images and text,
outperform models trained on image-text pairs on various multimodal benchmarks …

Git: A generative image-to-text transformer for vision and language

J Wang, Z Yang, X Hu, L Li, K Lin, Z Gan, Z Liu… - arXiv preprint arXiv …, 2022 - arxiv.org
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify
vision-language tasks such as image/video captioning and question answering. While …

Vision-language models for vision tasks: A survey

J Zhang, J Huang, S Jin, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …

Scene text recognition with permuted autoregressive sequence models

D Bautista, R Atienza - European conference on computer vision, 2022 - Springer
Context-aware STR methods typically use internal autoregressive (AR) language models
(LM). Inherent limitations of AR models motivated two-stage methods which employ an …

Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition

S Fang, H Xie, Y Wang, Z Mao… - Proceedings of the …, 2021 - openaccess.thecvf.com
Linguistic knowledge is of great benefit to scene text recognition. However, how to effectively
model linguistic rules in end-to-end deep networks remains a research challenge. In this …

SEED-Bench: Benchmarking Multimodal Large Language Models

B Li, Y Ge, Y Ge, G Wang, R Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Multimodal large language models (MLLMs) building upon the foundation of powerful large
language models (LLMs) have recently demonstrated exceptional capabilities in generating …

Svtr: Scene text recognition with a single visual model

Y Du, Z Chen, C Jia, X Yin, T Zheng, C Li, Y Du… - arXiv preprint arXiv …, 2022 - arxiv.org
Dominant scene text recognition models commonly contain two building blocks, a visual
model for feature extraction and a sequence model for text transcription. This hybrid …

From two to one: A new scene text recognizer with visual language modeling network

Y Wang, H Xie, S Fang, J Wang… - Proceedings of the …, 2021 - openaccess.thecvf.com
In this paper, we abandon the dominant complex language model and rethink the linguistic
learning process in the scene text recognition. Different from previous methods considering …