Generation and comprehension of unambiguous object descriptions

T Baltrušaitis, C Ahuja… - IEEE transactions on …, 2018 - ieeexplore.ieee.org

Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …

被引用次数：3185 相关文章所有 12 个版本

[PDF] jair.org

Survey of the state of the art in natural language generation: Core tasks, applications and evaluation

A Gatt, E Krahmer - Journal of Artificial Intelligence Research, 2018 - jair.org

This paper surveys the current state of the art in Natural Language Generation (NLG),
defined as the task of generating text or speech from non-linguistic input. A survey of NLG is …

被引用次数：1040 相关文章所有 15 个版本

[PDF] thecvf.com

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

P Anderson, Q Wu, D Teney, J Bruce… - Proceedings of the …, 2018 - openaccess.thecvf.com

A robot that can carry out a natural-language instruction has been a dream since before the
Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot …

被引用次数：1276 相关文章所有 16 个版本

[PDF] arxiv.org

A corpus for reasoning about natural language grounded in photographs

A Suhr, S Zhou, A Zhang, I Zhang, H Bai… - arXiv preprint arXiv …, 2018 - arxiv.org

We introduce a new dataset for joint reasoning about natural language and images, with a
focus on semantic diversity, compositionality, and visual reasoning challenges. The data …

被引用次数：521 相关文章所有 8 个版本

[PDF] thecvf.com

Mattnet: Modular attention network for referring expression comprehension

L Yu, Z Lin, X Shen, J Yang, X Lu… - Proceedings of the …, 2018 - openaccess.thecvf.com

In this paper, we address referring expression comprehension: localizing an image region
described by a natural language expression. While most recent work treats expressions as a …

被引用次数：818 相关文章所有 9 个版本

[PDF] thecvf.com

Neural baby talk

J Lu, J Yang, D Batra, D Parikh - Proceedings of the IEEE …, 2018 - openaccess.thecvf.com

We introduce a novel framework for image captioning that can produce natural language
explicitly grounded in entities that object detectors find in the image. Our approach …

被引用次数：547 相关文章所有 9 个版本

[PDF] neurips.cc

Speaker-follower models for vision-and-language navigation

D Fried, R Hu, V Cirik, A Rohrbach… - Advances in neural …, 2018 - proceedings.neurips.cc

Navigation guided by natural language instructions presents a challenging reasoning
problem for instruction followers. Natural language instructions typically identify only a few …

被引用次数：486 相关文章所有 8 个版本

[PDF] thecvf.com

A joint sequence fusion model for video question answering and retrieval

Y Yu, J Kim, G Kim - Proceedings of the European …, 2018 - openaccess.thecvf.com

We present an approach named JSFusion (Joint Sequence Fusion) that can measure
semantic similarity between any pairs of multimodal sequence data (eg a video sequence …

被引用次数：367 相关文章所有 10 个版本

[PDF] aclanthology.org

Temporally grounding natural sentence in video

J Chen, X Chen, L Ma, Z Jie… - Proceedings of the 2018 …, 2018 - aclanthology.org

We introduce an effective and efficient method that grounds (ie, localizes) natural sentences
in long, untrimmed video sequences. Specifically, a novel Temporal GroundNet (TGN) is …

被引用次数：322 相关文章所有 6 个版本

[PDF] researchgate.net

Cross-modal moment localization in videos

M Liu, X Wang, L Nie, Q Tian, B Chen… - Proceedings of the 26th …, 2018 - dl.acm.org

In this paper, we address the temporal moment localization issue, namely, localizing a video
moment described by a natural language query in an untrimmed video. This is a general yet …

被引用次数：217 相关文章所有 4 个版本

高级搜索

QQ 群