Multimodal machine learning: A survey and taxonomy

T Baltrušaitis, C Ahuja… - IEEE transactions on …, 2018 - ieeexplore.ieee.org
Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …

Survey of the state of the art in natural language generation: Core tasks, applications and evaluation

A Gatt, E Krahmer - Journal of Artificial Intelligence Research, 2018 - jair.org
This paper surveys the current state of the art in Natural Language Generation (NLG),
defined as the task of generating text or speech from non-linguistic input. A survey of NLG is …

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

P Anderson, Q Wu, D Teney, J Bruce… - Proceedings of the …, 2018 - openaccess.thecvf.com
A robot that can carry out a natural-language instruction has been a dream since before the
Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot …

A corpus for reasoning about natural language grounded in photographs

A Suhr, S Zhou, A Zhang, I Zhang, H Bai… - arXiv preprint arXiv …, 2018 - arxiv.org
We introduce a new dataset for joint reasoning about natural language and images, with a
focus on semantic diversity, compositionality, and visual reasoning challenges. The data …

Mattnet: Modular attention network for referring expression comprehension

L Yu, Z Lin, X Shen, J Yang, X Lu… - Proceedings of the …, 2018 - openaccess.thecvf.com
In this paper, we address referring expression comprehension: localizing an image region
described by a natural language expression. While most recent work treats expressions as a …

Neural baby talk

J Lu, J Yang, D Batra, D Parikh - Proceedings of the IEEE …, 2018 - openaccess.thecvf.com
We introduce a novel framework for image captioning that can produce natural language
explicitly grounded in entities that object detectors find in the image. Our approach …

Speaker-follower models for vision-and-language navigation

D Fried, R Hu, V Cirik, A Rohrbach… - Advances in neural …, 2018 - proceedings.neurips.cc
Navigation guided by natural language instructions presents a challenging reasoning
problem for instruction followers. Natural language instructions typically identify only a few …

A joint sequence fusion model for video question answering and retrieval

Y Yu, J Kim, G Kim - Proceedings of the European …, 2018 - openaccess.thecvf.com
We present an approach named JSFusion (Joint Sequence Fusion) that can measure
semantic similarity between any pairs of multimodal sequence data (eg a video sequence …

Temporally grounding natural sentence in video

J Chen, X Chen, L Ma, Z Jie… - Proceedings of the 2018 …, 2018 - aclanthology.org
We introduce an effective and efficient method that grounds (ie, localizes) natural sentences
in long, untrimmed video sequences. Specifically, a novel Temporal GroundNet (TGN) is …

Cross-modal moment localization in videos

M Liu, X Wang, L Nie, Q Tian, B Chen… - Proceedings of the 26th …, 2018 - dl.acm.org
In this paper, we address the temporal moment localization issue, namely, localizing a video
moment described by a natural language query in an untrimmed video. This is a general yet …