J Kil, S Changpinyo, X Chen, H Hu, S Goodman… - arXiv preprint arXiv …, 2022 - arxiv.org
The ability to recognize and reason about text embedded in visual inputs is often lacking in
vision-and-language (V&L) models, perhaps because V&L pre-training methods have often …