- 学术资源搜索

A comprehensive survey of deep learning for image captioning

MDZ Hossain, F Sohel, MF Shiratuddin… - ACM Computing Surveys …, 2019 - dl.acm.org

Generating a description of an image is called image captioning. Image captioning requires
recognizing the important objects, their attributes, and their relationships in an image. It also …

被引用次数：873 相关文章所有 8 个版本

[PDF] lau.edu.lb

Comparing deep learning models for low-light natural scene image enhancement and their impact on object detection and classification: Overview, empirical …

R Al Sobbahi, J Tekli - Signal Processing: Image Communication, 2022 - Elsevier

Low-light image (LLI) enhancement is an important image processing task that aims at
improving the illumination of images taken under low-light conditions. Recently, a …

被引用次数：61 相关文章所有 3 个版本

[PDF] thecvf.com

Gres: Generalized referring expression segmentation

C Liu, H Ding, X Jiang - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com

Abstract Referring Expression Segmentation (RES) aims to generate a segmentation mask
for the object described by a given language expression. Existing classic RES datasets and …

被引用次数：83 相关文章所有 6 个版本

[PDF] aclanthology.org

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Y Zeng, H Zhang, J Zheng, J Xia, G Wei… - Proceedings of the …, 2024 - aclanthology.org

Recent advancements in GPT-4V have displayed remarkable multi-modal capabilities in
processing image inputs and following open-ended instructions. Despite these …

被引用次数：42 相关文章所有 4 个版本

[PDF] arxiv.org

Attention guided low-light image enhancement with a large scale low-light simulation dataset

F Lv, Y Li, F Lu - International Journal of Computer Vision, 2021 - Springer

Low-light image enhancement is challenging in that it needs to consider not only brightness
recovery but also complex issues like color distortion and noise, which usually hide in the …

被引用次数：277 相关文章所有 9 个版本

[PDF] arxiv.org

Multimodal compact bilinear pooling for visual question answering and visual grounding

A Fukui, DH Park, D Yang, A Rohrbach… - arXiv preprint arXiv …, 2016 - arxiv.org

Modeling textual or visual information with vector representations trained from large
language or visual datasets has been successfully explored in recent years. However, tasks …

被引用次数：1777 相关文章所有 6 个版本

[PDF] thecvf.com

Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models

BA Plummer, L Wang, CM Cervantes… - Proceedings of the …, 2015 - openaccess.thecvf.com

The Flickr30k dataset has become a standard benchmark for sentence-based image
description. This paper presents Flickr30k Entities, which augments the 158k captions from …

被引用次数：1950 相关文章所有 29 个版本

[PDF] arxiv.org

Multi30k: Multilingual english-german image descriptions

D Elliott, S Frank, K Sima'an, L Specia - arXiv preprint arXiv:1605.00459, 2016 - arxiv.org

We introduce the Multi30K dataset to stimulate multilingual multimodal research. Recent
advances in image description have been demonstrated on English-language datasets …

被引用次数：629 相关文章所有 10 个版本

[PDF] arxiv.org

Deep captioning with multimodal recurrent neural networks (m-rnn)

J Mao, W Xu, Y Yang, J Wang, Z Huang… - arXiv preprint arXiv …, 2014 - arxiv.org

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for
generating novel image captions. It directly models the probability distribution of generating …

被引用次数：1542 相关文章所有 17 个版本

[PDF] aclanthology.org

[PDF][PDF] Referitgame: Referring to objects in photographs of natural scenes

S Kazemzadeh, V Ordonez, M Matten… - Proceedings of the 2014 …, 2014 - aclanthology.org

In this paper we introduce a new game to crowd-source natural language referring
expressions. By designing a two player game, we can both collect and verify referring …

被引用次数：1137 相关文章所有 14 个版本

高级搜索

QQ 群

A comprehensive survey of deep learning for image captioning

Comparing deep learning models for low-light natural scene image enhancement and their impact on object detection and classification: Overview, empirical …

Gres: Generalized referring expression segmentation

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Attention guided low-light image enhancement with a large scale low-light simulation dataset

Multimodal compact bilinear pooling for visual question answering and visual grounding

Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models

Multi30k: Multilingual english-german image descriptions

Deep captioning with multimodal recurrent neural networks (m-rnn)

[PDF][PDF] Referitgame: Referring to objects in photographs of natural scenes

引用