作者
Yusuf Aytar, Carl Vondrick, Antonio Torralba
发表日期
2017/6/3
期刊
arXiv preprint arXiv:1706.00932
简介
We capitalize on large amounts of readily-available, synchronous data to learn a deep discriminative representations shared across three major natural modalities: vision, sound and language. By leveraging over a year of sound from video and millions of sentences paired with images, we jointly train a deep convolutional network for aligned representation learning. Our experiments suggest that this representation is useful for several tasks, such as cross-modal retrieval or transferring classifiers between modalities. Moreover, although our network is only trained with image+text and image+sound pairs, it can transfer between text and sound as well, a transfer the network never observed during training. Visualizations of our representation reveal many hidden units which automatically emerge to detect concepts, independent of the modality.
引用总数
20182019202020212022202320243326301727239
学术搜索中的文章
Y Aytar, C Vondrick, A Torralba - arXiv preprint arXiv:1706.00932, 2017