Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is …
Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal …
L Huang, W Wang, J Chen… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
Attention mechanisms are widely used in current encoder/decoder frameworks of image captioning, where a weighted average on encoded vectors is generated at each time step to …
J Chen, H Guo, K Yi, B Li… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
The limited availability of annotated data often hinders real-world applications of machine learning. To efficiently learn from small quantities of multimodal data, we leverage the …
A Gatt, E Krahmer - Journal of Artificial Intelligence Research, 2018 - jair.org
This paper surveys the current state of the art in Natural Language Generation (NLG), defined as the task of generating text or speech from non-linguistic input. A survey of NLG is …
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper …
The Flickr30k dataset has become a standard benchmark for sentence-based image description. This paper presents Flickr30k Entities, which augments the 158k captions from …
A Karpathy, L Fei-Fei - Proceedings of the IEEE conference on …, 2015 - cv-foundation.org
We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to …
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper …