Fusecap: Leveraging large language models for enriched fused image captions

N Rotstein, D Bensaïd, S Brody… - Proceedings of the …, 2024 - openaccess.thecvf.com
The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

Out-of-vocabulary challenge report

S Garcia-Bordils, A Mafla, AF Biten, O Nuriel… - … on Computer Vision, 2022 - Springer
This paper presents final results of the Out-Of-Vocabulary 2022 (OOV) challenge. The OOV
contest introduces an important aspect that is not commonly studied by Optical Character …

Clipter: Looking at the bigger picture in scene text recognition

A Aberdam, D Bensaïd, A Golts… - Proceedings of the …, 2023 - openaccess.thecvf.com
Reading text in real-world scenarios often requires understanding the context surrounding it,
especially when dealing with poor-quality text. However, current scene text recognizers are …

FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions

N Rotstein, D Bensaid, S Brody, R Ganz… - arXiv preprint arXiv …, 2023 - arxiv.org
The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

Clipag: Towards generator-free text-to-image generation

R Ganz, M Elad - Proceedings of the IEEE/CVF Winter …, 2024 - openaccess.thecvf.com
Abstract Perceptually Aligned Gradients (PAG) refer to an intriguing property observed in
robust image classification models, wherein their input gradients align with human …

Towards models that can see and read

R Ganz, O Nuriel, A Aberdam… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Visual Question Answering (VQA) and Image Captioning (CAP), which are among
the most popular vision-language tasks, have analogous scene-text versions that require …

Question aware vision transformer for multimodal reasoning

R Ganz, Y Kittenplon, A Aberdam… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision-Language (VL) models have gained significant research focus enabling remarkable
advances in multimodal reasoning. These architectures typically comprise a vision encoder …

Fine-grained Pseudo Labels for Scene Text Recognition

X Li, X Chen, Z Huang, L Xie, J Chen… - Proceedings of the 31st …, 2023 - dl.acm.org
Pseudo-Labeling based semi-supervised learning has shown promising advantages in
Scene Text Recognition (STR). Most of them usually use a pre-trained model to generate …

[图书][B] Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XX

S Avidan, G Brostow, M Cissé, GM Farinella, T Hassner - 2022 - books.google.com
The 39-volume set, comprising the LNCS books 13661 until 13699, constitutes the refereed
proceedings of the 17th European Conference on Computer Vision, ECCV 2022, held in Tel …

GRAM: Global reasoning for multi-page VQA

T Blau, S Fogel, R Ronen, A Golts… - Proceedings of the …, 2024 - openaccess.thecvf.com
The increasing use of transformer-based large language models brings forward the
challenge of processing long sequences. In document visual question answering (DocVQA) …