Fusecap: Leveraging large language models for enriched fused image captions

N Rotstein, D Bensaïd, S Brody… - Proceedings of the …, 2024 - openaccess.thecvf.com
The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions

N Rotstein, D Bensaid, S Brody, R Ganz… - arXiv preprint arXiv …, 2023 - arxiv.org
The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

Clipag: Towards generator-free text-to-image generation

R Ganz, M Elad - Proceedings of the IEEE/CVF Winter …, 2024 - openaccess.thecvf.com
Abstract Perceptually Aligned Gradients (PAG) refer to an intriguing property observed in
robust image classification models, wherein their input gradients align with human …

Prestu: Pre-training for scene-text understanding

J Kil, S Changpinyo, X Chen, H Hu… - Proceedings of the …, 2023 - openaccess.thecvf.com
The ability to recognize and reason about text embedded in visual inputs is often lacking in
vision-and-language (V&L) models, perhaps because V&L pre-training methods have often …

Enhancing vision-language pre-training with rich supervisions

Y Gao, K Shi, P Zhu, E Belval… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract We propose Strongly Supervised pre-training with ScreenShots (S4)-a novel pre-
training paradigm for Vision-Language Models using data from large-scale web screenshot …

Question aware vision transformer for multimodal reasoning

R Ganz, Y Kittenplon, A Aberdam… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision-Language (VL) models have gained significant research focus enabling remarkable
advances in multimodal reasoning. These architectures typically comprise a vision encoder …

GRAM: Global reasoning for multi-page VQA

T Blau, S Fogel, R Ronen, A Golts… - Proceedings of the …, 2024 - openaccess.thecvf.com
The increasing use of transformer-based large language models brings forward the
challenge of processing long sequences. In document visual question answering (DocVQA) …

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

O Abramovich, N Nayman, S Fogel, I Lavi… - arXiv preprint arXiv …, 2024 - arxiv.org
In recent years, notable advancements have been made in the domain of visual document
understanding, with the prevailing architecture comprising a cascade of vision and language …

Scene text visual question answering by using YOLO and STN

K Nourali, E Dolkhani - International Journal of Speech Technology, 2024 - Springer
Extracting text from an image using a Visual Question Answering (VQA) system is an
application at the intersection of computer vision and Natural Language Processing (NLP) to …

A Survey on Visual Question Answering Methodologies

AM Al-Zoghby, AS Saleh - The Egyptian Journal of Language …, 2024 - journals.ekb.eg
Understanding visual question-answering (VQA) will be essential for many human tasks.
However, it poses significant obstacles at the core of artificial intelligence as a multimodal …