- 学术资源搜索

Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier

Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

被引用次数：88 相关文章所有 5 个版本

[PDF] port.ac.uk

Visuals to text: A comprehensive review on automatic image captioning

Y Ming, N Hu, C Fan, F Feng… - IEEE/CAA Journal of …, 2022 - researchportal.port.ac.uk

Image captioning refers to automatic generation of descriptive texts according to the visual
content of images. It is a technique integrating multiple disciplines including the computer …

被引用次数：31 相关文章所有 6 个版本

[PDF] aaai.org

Dual-level collaborative transformer for image captioning

Y Luo, J Ji, X Sun, L Cao, Y Wu, F Huang… - Proceedings of the …, 2021 - ojs.aaai.org

Descriptive region features extracted by object detection networks have played an important
role in the recent advancements of image captioning. However, they are still criticized for the …

被引用次数：265 相关文章所有 6 个版本

[PDF] thecvf.com

Look before you speak: Visually contextualized utterances

PH Seo, A Nagrani, C Schmid - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

While most conversational AI systems focus on textual dialogue only, conditioning
utterances on visual context (when it's available) can lead to more realistic conversations …

被引用次数：78 相关文章所有 6 个版本

Deep image captioning: A review of methods, trends and future challenges

L Xu, Q Tang, J Lv, B Zheng, X Zeng, W Li - Neurocomputing, 2023 - Elsevier

Image captioning, also called report generation in medical field, aims to describe visual
content of images in human language, which requires to model semantic relationship …

被引用次数：17 相关文章所有 2 个版本

Cross-modal text and visual generation: A systematic review. Part 1: Image to text

M Żelaszczyk, J Mańdziuk - Information Fusion, 2023 - Elsevier

We review the existing literature on generating text from visual data under the cross-modal
generation umbrella, which affords us to compare and contrast various approaches taking …

被引用次数：9 相关文章所有 4 个版本

[PDF] thecvf.com

Affection: Learning affective explanations for real-world visual data

P Achlioptas, M Ovsjanikov… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we explore the space of emotional reactions induced by real-world images. For
this, we first introduce a large-scale dataset that contains both categorical emotional …

被引用次数：5 相关文章所有 10 个版本

From methods to datasets: A survey on Image-Caption Generators

L Agarwal, B Verma - Multimedia Tools and Applications, 2024 - Springer

Abstract Image-Caption Generator is a popular Artificial Intelligence research tool that works
with image comprehension and language definition. Creating well-structured sentences …

被引用次数：3 相关文章

[PDF] ejournal.org.cn

[PDF][PDF] 基于深度学习的图像描述综述

石义乐，杨文忠，杜慧祥，王丽花，王婷，理珊珊 - 电子学报, 2021 - ejournal.org.cn

图像描述旨在通过提取图像的特征输入到语言生成模型中最后输出图像对应的描述,
来解决人工智能中自然语言处理与计算机视觉的交叉领域问题——智能图像理解. 现对2015 …

被引用次数：8 相关文章所有 3 个版本

Diagram perception networks for textbook question answering via joint optimization

J Ma, J Liu, Q Chai, P Wang, J Tao - International Journal of Computer …, 2024 - Springer

Textbook question answering requires a system to answer questions with or without
diagrams accurately, given multimodal contexts that include rich paragraphs and diagrams …

被引用次数：2 相关文章所有 2 个版本

高级搜索

QQ 群