Towards models that can see and read

N Rotstein, D Bensaïd, S Brody… - Proceedings of the …, 2024 - openaccess.thecvf.com

The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

被引用次数：8 相关文章所有 4 个版本

[PDF] arxiv.org

FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions

N Rotstein, D Bensaid, S Brody, R Ganz… - arXiv preprint arXiv …, 2023 - arxiv.org

The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

被引用次数：14 相关文章所有 2 个版本

[PDF] thecvf.com

Clipag: Towards generator-free text-to-image generation

R Ganz, M Elad - Proceedings of the IEEE/CVF Winter …, 2024 - openaccess.thecvf.com

Abstract Perceptually Aligned Gradients (PAG) refer to an intriguing property observed in
robust image classification models, wherein their input gradients align with human …

被引用次数：5 相关文章所有 5 个版本

[PDF] thecvf.com

Prestu: Pre-training for scene-text understanding

J Kil, S Changpinyo, X Chen, H Hu… - Proceedings of the …, 2023 - openaccess.thecvf.com

The ability to recognize and reason about text embedded in visual inputs is often lacking in
vision-and-language (V&L) models, perhaps because V&L pre-training methods have often …

被引用次数：21 相关文章所有 9 个版本

[PDF] thecvf.com

Enhancing vision-language pre-training with rich supervisions

Y Gao, K Shi, P Zhu, E Belval… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract We propose Strongly Supervised pre-training with ScreenShots (S4)-a novel pre-
training paradigm for Vision-Language Models using data from large-scale web screenshot …

被引用次数：2 相关文章所有 6 个版本

[PDF] thecvf.com

Question aware vision transformer for multimodal reasoning

R Ganz, Y Kittenplon, A Aberdam… - Proceedings of the …, 2024 - openaccess.thecvf.com

Vision-Language (VL) models have gained significant research focus enabling remarkable
advances in multimodal reasoning. These architectures typically comprise a vision encoder …

被引用次数：3 相关文章所有 5 个版本

[PDF] thecvf.com

GRAM: Global reasoning for multi-page VQA

T Blau, S Fogel, R Ronen, A Golts… - Proceedings of the …, 2024 - openaccess.thecvf.com

The increasing use of transformer-based large language models brings forward the
challenge of processing long sequences. In document visual question answering (DocVQA) …

被引用次数：1 相关文章所有 5 个版本

[PDF] arxiv.org

高级搜索

QQ 群