相关文章- 学术资源搜索

A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching

P Das, C Xu, RF Doell, JJ Corso - Proceedings of the IEEE …, 2013 - openaccess.thecvf.com

The problem of describing images through natural language has gained importance in the
computer vision community. Solutions to image description have either focused on a top …

被引用次数：396 相关文章所有 15 个版本

[PDF] thecvf.com

Translating video content to natural language descriptions

M Rohrbach, W Qiu, I Titov, S Thater… - Proceedings of the …, 2013 - openaccess.thecvf.com

Humans use rich natural language to describe and communicate visual perceptions. In
order to provide natural language descriptions for visual content, this paper combines two …

被引用次数：465 相关文章所有 15 个版本

[PDF] thecvf.com

Msr-vtt: A large video description dataset for bridging video and language

J Xu, T Mei, T Yao, Y Rui - Proceedings of the IEEE …, 2016 - openaccess.thecvf.com

While there has been increasing interest in the task of describing video with natural
language, current computer vision algorithms are still severely limited in terms of the …

被引用次数：1925 相关文章所有 10 个版本

[PDF] thecvf.com

Dense captioning with joint inference and visual context

L Yang, K Tang, J Yang, LJ Li - Proceedings of the IEEE …, 2017 - openaccess.thecvf.com

Dense captioning is a newly emerging computer vision topic for understanding images with
dense language descriptions. The goal is to densely detect visual concepts (eg, objects …

被引用次数：202 相关文章所有 8 个版本

[PDF] thecvf.com

From captions to visual concepts and back

H Fang, S Gupta, F Iandola… - Proceedings of the …, 2015 - openaccess.thecvf.com

This paper presents a novel approach for automatically generating image descriptions:
visual detectors, language models, and multimodal similarity models learnt directly from a …

被引用次数：1620 相关文章所有 24 个版本

[PDF] thecvf.com

Image generation from scene graphs

J Johnson, A Gupta, L Fei-Fei - Proceedings of the IEEE …, 2018 - openaccess.thecvf.com

To truly understand the visual world our models should be able not only to recognize images
but also generate them. To this end, there has been exciting recent progress on gen-erating …

被引用次数：886 相关文章所有 10 个版本

[PDF] thecvf.com

Grounded video description

L Zhou, Y Kalantidis, X Chen… - Proceedings of the …, 2019 - openaccess.thecvf.com

Video description is one of the most challenging problems in vision and language
understanding due to the large variability both on the video and language side. Models …

被引用次数：211 相关文章所有 8 个版本

[PDF] thecvf.com

Describing videos by exploiting temporal structure

L Yao, A Torabi, K Cho, N Ballas… - Proceedings of the …, 2015 - openaccess.thecvf.com

Recent progress in using recurrent neural networks (RNNs) for image description has
motivated the exploration of their application for video description. However, while images …

被引用次数：1319 相关文章所有 14 个版本

[PDF] thecvf.com

Jointly localizing and describing events for dense video captioning

Y Li, T Yao, Y Pan, H Chao… - Proceedings of the IEEE …, 2018 - openaccess.thecvf.com

Automatically describing a video with natural language is regarded as a fundamental
challenge in computer vision. The problem nevertheless is not trivial especially when a …

被引用次数：204 相关文章所有 7 个版本

[PDF] cv-foundation.org

Deep visual-semantic alignments for generating image descriptions

A Karpathy, L Fei-Fei - Proceedings of the IEEE conference on …, 2015 - cv-foundation.org

We present a model that generates natural language descriptions of images and their
regions. Our approach leverages datasets of images and their sentence descriptions to …

被引用次数：6901 相关文章所有 39 个版本

高级搜索

QQ 群

A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching

Translating video content to natural language descriptions

Msr-vtt: A large video description dataset for bridging video and language

Dense captioning with joint inference and visual context

From captions to visual concepts and back

Image generation from scene graphs

Grounded video description

Describing videos by exploiting temporal structure

Jointly localizing and describing events for dense video captioning

Deep visual-semantic alignments for generating image descriptions

引用