A comprehensive survey of deep learning for image captioning

MDZ Hossain, F Sohel, MF Shiratuddin… - ACM Computing Surveys …, 2019 - dl.acm.org
Generating a description of an image is called image captioning. Image captioning requires
recognizing the important objects, their attributes, and their relationships in an image. It also …

Comparing deep learning models for low-light natural scene image enhancement and their impact on object detection and classification: Overview, empirical …

R Al Sobbahi, J Tekli - Signal Processing: Image Communication, 2022 - Elsevier
Low-light image (LLI) enhancement is an important image processing task that aims at
improving the illumination of images taken under low-light conditions. Recently, a …

Gres: Generalized referring expression segmentation

C Liu, H Ding, X Jiang - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com
Abstract Referring Expression Segmentation (RES) aims to generate a segmentation mask
for the object described by a given language expression. Existing classic RES datasets and …

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Y Zeng, H Zhang, J Zheng, J Xia, G Wei… - Proceedings of the …, 2024 - aclanthology.org
Recent advancements in GPT-4V have displayed remarkable multi-modal capabilities in
processing image inputs and following open-ended instructions. Despite these …

Attention guided low-light image enhancement with a large scale low-light simulation dataset

F Lv, Y Li, F Lu - International Journal of Computer Vision, 2021 - Springer
Low-light image enhancement is challenging in that it needs to consider not only brightness
recovery but also complex issues like color distortion and noise, which usually hide in the …

Multimodal compact bilinear pooling for visual question answering and visual grounding

A Fukui, DH Park, D Yang, A Rohrbach… - arXiv preprint arXiv …, 2016 - arxiv.org
Modeling textual or visual information with vector representations trained from large
language or visual datasets has been successfully explored in recent years. However, tasks …

Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models

BA Plummer, L Wang, CM Cervantes… - Proceedings of the …, 2015 - openaccess.thecvf.com
The Flickr30k dataset has become a standard benchmark for sentence-based image
description. This paper presents Flickr30k Entities, which augments the 158k captions from …

Multi30k: Multilingual english-german image descriptions

D Elliott, S Frank, K Sima'an, L Specia - arXiv preprint arXiv:1605.00459, 2016 - arxiv.org
We introduce the Multi30K dataset to stimulate multilingual multimodal research. Recent
advances in image description have been demonstrated on English-language datasets …

Deep captioning with multimodal recurrent neural networks (m-rnn)

J Mao, W Xu, Y Yang, J Wang, Z Huang… - arXiv preprint arXiv …, 2014 - arxiv.org
In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for
generating novel image captions. It directly models the probability distribution of generating …

[PDF][PDF] Referitgame: Referring to objects in photographs of natural scenes

S Kazemzadeh, V Ordonez, M Matten… - Proceedings of the 2014 …, 2014 - aclanthology.org
In this paper we introduce a new game to crowd-source natural language referring
expressions. By designing a two player game, we can both collect and verify referring …