Fusecap: Leveraging large language models for enriched fused image captions

N Rotstein, D Bensaïd, S Brody… - Proceedings of the …, 2024 - openaccess.thecvf.com
The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

Symmetrical linguistic feature distillation with clip for scene text recognition

Z Wang, H Xie, Y Wang, J Xu, B Zhang… - Proceedings of the 31st …, 2023 - dl.acm.org
In this paper, we explore the potential of the Contrastive Language-Image Pretraining (CLIP)
model in scene text recognition (STR), and establish a novel Symmetrical Linguistic Feature …

FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions

N Rotstein, D Bensaid, S Brody, R Ganz… - arXiv preprint arXiv …, 2023 - arxiv.org
The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

CLIP4STR: A simple baseline for scene text recognition with pre-trained vision-language model

S Zhao, R Quan, L Zhu, Y Yang - arXiv preprint arXiv:2305.14014, 2023 - arxiv.org
Pre-trained vision-language models~(VLMs) are the de-facto foundation models for various
downstream tasks. However, scene text recognition methods still prefer backbones pre …

Question aware vision transformer for multimodal reasoning

R Ganz, Y Kittenplon, A Aberdam… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision-Language (VL) models have gained significant research focus enabling remarkable
advances in multimodal reasoning. These architectures typically comprise a vision encoder …

GRAM: Global reasoning for multi-page VQA

T Blau, S Fogel, R Ronen, A Golts… - Proceedings of the …, 2024 - openaccess.thecvf.com
The increasing use of transformer-based large language models brings forward the
challenge of processing long sequences. In document visual question answering (DocVQA) …

Harnessing the Power of Multi-Lingual Datasets for Pre-training: Towards Enhancing Text Spotting Performance

A Das, S Biswas, A Banerjee, J Lladós… - Proceedings of the …, 2024 - openaccess.thecvf.com
The adaptation capability to a wide range of domains is crucial for scene text spotting
models when deployed to real-world conditions. However, existing state-of-the-art …

A Region-Prompted Adapter Tuning for Visual Abductive Reasoning

H Zhang, YK Ee, B Fernando - arXiv preprint arXiv:2303.10428, 2023 - arxiv.org
Visual Abductive Reasoning is an emerging vision-language (VL) topic where the model
needs to retrieve/generate a likely textual hypothesis from a visual input (image or its part) …

Open-Set Text Recognition Implementations (II): Sample-to-Representation Mapping

XC Yin, C Yang, C Liu - Open-Set Text Recognition: Concepts, Framework …, 2024 - Springer
This chapter introduces how representations are encoded and extracted from samples, ie,
the sample-to-representation mapping module in the framework discussed above (Fig.). The …

Open-Set Text Recognition: Concept, Dataset, Protocol, and Framework

XC Yin, C Yang, C Liu - Open-Set Text Recognition: Concepts, Framework …, 2024 - Springer
This chapter gives a clear and detailed definition of the OSTR task. First, we describe the
aim, goal, and scope, and formulate and define the OSTR task and the relation between …