J Pont-Tuset, J Uijlings, S Changpinyo… - arXiv e …, 2019 - ui.adsabs.harvard.edu
Abstract We propose Localized Narratives, a new form of multimodal image annotations
connecting vision and language. We ask annotators to describe an image with their voice …