Multimodal semi-supervised learning for text recognition

N Rotstein, D Bensaïd, S Brody… - Proceedings of the …, 2024 - openaccess.thecvf.com

The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

被引用次数：8 相关文章所有 4 个版本

[PDF] arxiv.org

Out-of-vocabulary challenge report

S Garcia-Bordils, A Mafla, AF Biten, O Nuriel… - … on Computer Vision, 2022 - Springer

This paper presents final results of the Out-Of-Vocabulary 2022 (OOV) challenge. The OOV
contest introduces an important aspect that is not commonly studied by Optical Character …

被引用次数：17 相关文章所有 8 个版本

[PDF] thecvf.com

Clipter: Looking at the bigger picture in scene text recognition

A Aberdam, D Bensaïd, A Golts… - Proceedings of the …, 2023 - openaccess.thecvf.com

Reading text in real-world scenarios often requires understanding the context surrounding it,
especially when dealing with poor-quality text. However, current scene text recognizers are …

被引用次数：12 相关文章所有 8 个版本

[PDF] arxiv.org

FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions

N Rotstein, D Bensaid, S Brody, R Ganz… - arXiv preprint arXiv …, 2023 - arxiv.org

The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

被引用次数：14 相关文章所有 2 个版本

[PDF] thecvf.com

Clipag: Towards generator-free text-to-image generation

R Ganz, M Elad - Proceedings of the IEEE/CVF Winter …, 2024 - openaccess.thecvf.com

Abstract Perceptually Aligned Gradients (PAG) refer to an intriguing property observed in
robust image classification models, wherein their input gradients align with human …

被引用次数：5 相关文章所有 5 个版本

[PDF] thecvf.com

Towards models that can see and read

R Ganz, O Nuriel, A Aberdam… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Visual Question Answering (VQA) and Image Captioning (CAP), which are among
the most popular vision-language tasks, have analogous scene-text versions that require …

被引用次数：10 相关文章所有 7 个版本

[PDF] thecvf.com

Question aware vision transformer for multimodal reasoning

R Ganz, Y Kittenplon, A Aberdam… - Proceedings of the …, 2024 - openaccess.thecvf.com

Vision-Language (VL) models have gained significant research focus enabling remarkable
advances in multimodal reasoning. These architectures typically comprise a vision encoder …

被引用次数：3 相关文章所有 5 个版本

Fine-grained Pseudo Labels for Scene Text Recognition

X Li, X Chen, Z Huang, L Xie, J Chen… - Proceedings of the 31st …, 2023 - dl.acm.org

Pseudo-Labeling based semi-supervised learning has shown promising advantages in
Scene Text Recognition (STR). Most of them usually use a pre-trained model to generate …

被引用次数：1 相关文章

[图书][B] Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XX

S Avidan, G Brostow, M Cissé, GM Farinella, T Hassner - 2022 - books.google.com

The 39-volume set, comprising the LNCS books 13661 until 13699, constitutes the refereed
proceedings of the 17th European Conference on Computer Vision, ECCV 2022, held in Tel …

被引用次数：6 相关文章所有 6 个版本

[PDF] thecvf.com

GRAM: Global reasoning for multi-page VQA

T Blau, S Fogel, R Ronen, A Golts… - Proceedings of the …, 2024 - openaccess.thecvf.com

The increasing use of transformer-based large language models brings forward the
challenge of processing long sequences. In document visual question answering (DocVQA) …

被引用次数：1 相关文章所有 5 个版本

高级搜索

QQ 群