Clearly explaining a rationale for a classification decision to an end user can be as important as the decision itself. Existing approaches for deep visual recognition are generally opaque …
In this paper, we address the task of natural language object retrieval, to localize a target object within a given image based on a natural language query of the object. Natural …
Referring expressions usually describe an object using properties of the object and relationships of the object with other objects. We propose a technique that integrates context …
Grounding (ie localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text …
In this paper we approach the novel problem of segmenting an image based on a natural language expression. This is different from traditional semantic segmentation over a …
Convolutional neural networks (CNN) pre-trained on ImageNet are the backbone of most state-of-the-art approaches. In this paper, we present a new set of pre-trained models with …
J Andreas, D Klein - arXiv preprint arXiv:1604.00562, 2016 - arxiv.org
We present a model for pragmatically describing scenes, in which contrastive behavior results from a combination of inference-driven pragmatics and learned semantics. Like …
A great video title describes the most salient event compactly and captures the viewer's attention. In contrast, video captioning tends to generate sentences that describe the video …
S Zarrieß, D Schlangen - Proceedings of the 54th Annual Meeting …, 2016 - aclanthology.org
Research on generating referring expressions has so far mostly focussed on “oneshot reference”, where the aim is to generate a single, discriminating expression. In interactive …