Aligning where to see and what to tell: image caption with region-based attention and scene...

MDZ Hossain, F Sohel, MF Shiratuddin… - ACM Computing Surveys …, 2019 - dl.acm.org

Generating a description of an image is called image captioning. Image captioning requires
recognizing the important objects, their attributes, and their relationships in an image. It also …

被引用次数：998 相关文章所有 8 个版本

[PDF] ieee.org

Automatic chart understanding: a review

AM Farahani, P Adibi, MS Ehsani, HP Hutter… - IEEE …, 2023 - ieeexplore.ieee.org

Automated chart analysis has vast potential to improve the accessibility of charts for a wider
audience, eg, people with visual impairments or other disabilities, by generating captions for …

被引用次数：26 相关文章所有 4 个版本

[PDF] thecvf.com

Bottom-up and top-down attention for image captioning and visual question answering

P Anderson, X He, C Buehler… - Proceedings of the …, 2018 - openaccess.thecvf.com

Top-down visual attention mechanisms have been used extensively in image captioning
and visual question answering (VQA) to enable deeper image understanding through fine …

被引用次数：5603 相关文章所有 16 个版本

[PDF] thecvf.com

Semantic compositional networks for visual captioning

Z Gan, C Gan, X He, Y Pu, K Tran… - Proceedings of the …, 2017 - openaccess.thecvf.com

Abstract A Semantic Compositional Network (SCN) is developed for image captioning, in
which semantic concepts (ie, tags) are detected from the image, and the probability of each …

被引用次数：547 相关文章所有 12 个版本

[PDF] arxiv.org

Video summarization with long short-term memory

K Zhang, WL Chao, F Sha, K Grauman - … 14, 2016, Proceedings, Part VII 14, 2016 - Springer

We propose a novel supervised learning technique for summarizing videos by automatically
selecting keyframes or key subshots. Casting the task as a structured prediction problem …

被引用次数：895 相关文章所有 6 个版本

[PDF] arxiv.org

Grounding of textual phrases in images by reconstruction

A Rohrbach, M Rohrbach, R Hu, T Darrell… - Computer Vision–ECCV …, 2016 - Springer

Grounding (ie localizing) arbitrary, free-form textual phrases in visual content is a
challenging problem with many applications for human-computer interaction and image-text …

被引用次数：562 相关文章所有 7 个版本

[PDF] arxiv.org

Image captioning and visual question answering based on attributes and external knowledge

Q Wu, C Shen, P Wang, A Dick… - IEEE transactions on …, 2017 - ieeexplore.ieee.org

Much of the recent progress in Vision-to-Language problems has been achieved through a
combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks …

被引用次数：492 相关文章所有 8 个版本

[PDF] cv-foundation.org

What value do explicit high level concepts have in vision to language problems?

Q Wu, C Shen, L Liu, A Dick… - Proceedings of the …, 2016 - cv-foundation.org

Much recent progress in Vision-to-Language (V2L) problems has been achieved through a
combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks …

被引用次数：554 相关文章所有 10 个版本

[PDF] arxiv.org

Abc-cnn: An attention based convolutional neural network for visual question answering

K Chen, J Wang, LC Chen, H Gao, W Xu… - arXiv preprint arXiv …, 2015 - arxiv.org

We propose a novel attention based deep learning architecture for visual question
answering task (VQA). Given an image and an image related natural language question …

被引用次数：404 相关文章所有 5 个版本

[PDF] ict.ac.cn

Know more say less: Image captioning based on scene graphs

X Li, S Jiang - IEEE Transactions on Multimedia, 2019 - ieeexplore.ieee.org

Automatically describing the content of an image has been attracting considerable research
attention in the multimedia field. To represent the content of an image, many approaches …

被引用次数：197 相关文章所有 5 个版本

高级搜索

QQ 群