Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

Visuals to text: A comprehensive review on automatic image captioning

Y Ming, N Hu, C Fan, F Feng… - IEEE/CAA Journal of …, 2022 - researchportal.port.ac.uk
Image captioning refers to automatic generation of descriptive texts according to the visual
content of images. It is a technique integrating multiple disciplines including the computer …

Dual-level collaborative transformer for image captioning

Y Luo, J Ji, X Sun, L Cao, Y Wu, F Huang… - Proceedings of the …, 2021 - ojs.aaai.org
Descriptive region features extracted by object detection networks have played an important
role in the recent advancements of image captioning. However, they are still criticized for the …

Look before you speak: Visually contextualized utterances

PH Seo, A Nagrani, C Schmid - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
While most conversational AI systems focus on textual dialogue only, conditioning
utterances on visual context (when it's available) can lead to more realistic conversations …

Deep image captioning: A review of methods, trends and future challenges

L Xu, Q Tang, J Lv, B Zheng, X Zeng, W Li - Neurocomputing, 2023 - Elsevier
Image captioning, also called report generation in medical field, aims to describe visual
content of images in human language, which requires to model semantic relationship …

Cross-modal text and visual generation: A systematic review. Part 1: Image to text

M Żelaszczyk, J Mańdziuk - Information Fusion, 2023 - Elsevier
We review the existing literature on generating text from visual data under the cross-modal
generation umbrella, which affords us to compare and contrast various approaches taking …

Affection: Learning affective explanations for real-world visual data

P Achlioptas, M Ovsjanikov… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this work, we explore the space of emotional reactions induced by real-world images. For
this, we first introduce a large-scale dataset that contains both categorical emotional …

From methods to datasets: A survey on Image-Caption Generators

L Agarwal, B Verma - Multimedia Tools and Applications, 2024 - Springer
Abstract Image-Caption Generator is a popular Artificial Intelligence research tool that works
with image comprehension and language definition. Creating well-structured sentences …

[PDF][PDF] 基于深度学习的图像描述综述

石义乐, 杨文忠, 杜慧祥, 王丽花, 王婷, 理珊珊 - 电子学报, 2021 - ejournal.org.cn
图像描述旨在通过提取图像的特征输入到语言生成模型中最后输出图像对应的描述,
来解决人工智能中自然语言处理与计算机视觉的交叉领域问题——智能图像理解. 现对2015 …

Diagram perception networks for textbook question answering via joint optimization

J Ma, J Liu, Q Chai, P Wang, J Tao - International Journal of Computer …, 2024 - Springer
Textbook question answering requires a system to answer questions with or without
diagrams accurately, given multimodal contexts that include rich paragraphs and diagrams …