Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

From general to specific: Informative scene graph generation via balance adjustment

Y Guo, L Gao, X Wang, Y Hu, X Xu… - Proceedings of the …, 2021 - openaccess.thecvf.com
The scene graph generation (SGG) task aims to detect visual relationship triplets, ie, subject,
predicate, object, in an image, providing a structural vision layout for scene understanding …

Visual news: Benchmark and challenges in news image captioning

F Liu, Y Wang, T Wang, V Ordonez - arXiv preprint arXiv:2010.03743, 2020 - arxiv.org
We propose Visual News Captioner, an entity-aware model for the task of news image
captioning. We also introduce Visual News, a large-scale benchmark consisting of more …

Transform and tell: Entity-aware news image captioning

A Tran, A Mathews, L Xie - … of the IEEE/CVF conference on …, 2020 - openaccess.thecvf.com
We propose an end-to-end model which generates captions for images embedded in news
articles. News images present two key challenges: they rely on real-world knowledge …

Improving image captioning with better use of captions

Z Shi, X Zhou, X Qiu, X Zhu - arXiv preprint arXiv:2006.11807, 2020 - arxiv.org
Image captioning is a multimodal problem that has drawn extensive attention in both the
natural language processing and computer vision community. In this paper, we present a …

Underspecification in scene description-to-depiction tasks

B Hutchinson, J Baldridge, V Prabhakaran - arXiv preprint arXiv …, 2022 - arxiv.org
Questions regarding implicitness, ambiguity and underspecification are crucial for
understanding the task validity and ethical concerns of multimodal image+ text systems, yet …

Boosting entity-aware image captioning with multi-modal knowledge graph

W Zhao, X Wu - IEEE Transactions on Multimedia, 2023 - ieeexplore.ieee.org
Entity-aware image captioning aims to describe named entities and events related to the
image by utilizing the background knowledge in the associated article. This task remains …

A unified framework for slot based response generation in a multimodal dialogue system

M Firdaus, A Madasu, A Ekbal - Multimedia Tools and Applications, 2024 - Springer
Abstract Natural Language Understanding (NLU) and Natural Language Generation (NLG)
are the two critical components of every conversational system that handles the task of …

Reinforcing an image caption generator using off-line human feedback

PH Seo, P Sharma, T Levinboim, B Han… - Proceedings of the AAAI …, 2020 - aaai.org
Human ratings are currently the most accurate way to assess the quality of an image
captioning model, yet most often the only used outcome of an expensive human rating …

Quality estimation for image captions based on large-scale human evaluations

T Levinboim, AV Thapliyal, P Sharma… - arXiv preprint arXiv …, 2019 - arxiv.org
Automatic image captioning has improved significantly over the last few years, but the
problem is far from being solved, with state of the art models still often producing low quality …