查看文章

arxiv.org 中的 [PDF]

Deep captioning with multimodal recurrent neural networks (m-rnn)

作者

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, Alan Yuille

发表日期

2014/12/20

期刊

arXiv preprint arXiv:1412.6632

简介

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, we apply the m-RNN model to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval. The project page of this work is: www.stat.ucla.edu/~junhua.mao/m-RNN.html .

引用总数

被引用次数：1540

201520162017201820192020202120222023202454 126 181 204 233 177 206 153 116 56

学术搜索中的文章

Deep captioning with multimodal recurrent neural networks (m-rnn)

J Mao, W Xu, Y Yang, J Wang, Z Huang, A Yuille - arXiv preprint arXiv:1412.6632, 2014

被引用次数：1540 相关文章所有 17 个版本