Next-gpt: Any-to-any multimodal llm

S Wu, H Fei, L Qu, W Ji, TS Chua - arXiv preprint arXiv:2309.05519, 2023 - arxiv.org
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides,
they mostly fall prey to the limitation of only input-side multimodal understanding, without the …

Toward More Human-Like AI Communication: A Review of Emergent Communication Research

N Brandizzi - IEEE Access, 2023 - ieeexplore.ieee.org
In the recent shift towards human-centric AI, the need for machines to accurately use natural
language has become increasingly important. While a common approach to achieve this is …

SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

G Sun, C Qin, J Wang, Z Chen, R Xu, Z Tao - European Conference on …, 2025 - Springer
Recent advances in vision-language models have shown notable generalization in broad
tasks through visual instruction tuning. However, bridging the gap between the pre-trained …

Towards Semantic Equivalence of Tokenization in Multimodal LLM

S Wu, H Fei, X Li, J Ji, H Zhang, TS Chua… - arXiv preprint arXiv …, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in
processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization …

[PDF][PDF] Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis

D Bucciarelli, N Moratelli, M Cornia… - Proceedings of the …, 2024 - iris.unimore.it
The task of image captioning demands an algorithm to generate natural language
descriptions of visual inputs. Recent advancements have seen a convergence between …

Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

S Sarto, N Moratelli, M Cornia, L Baraldi… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite significant advancements in caption generation, existing evaluation metrics often fail
to capture the full quality or fine-grained details of captions. This is mainly due to their …

Overcoming the Pitfalls of Vision-Language Model for Image-Text Retrieval

F Zhang, S Qu, F Shi, C Xu - Proceedings of the 32nd ACM International …, 2024 - dl.acm.org
This work tackles the persistent challenge of image-text retrieval, a key problem at the
intersection of computer vision and natural language processing. Despite significant …

No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

M Gaur, M Tapaswi - arXiv preprint arXiv:2409.03025, 2024 - arxiv.org
Image captioning systems are unable to generate fine-grained captions as they are trained
on data that is either noisy (alt-text) or generic (human annotations). This is further …

Fluent and Accurate Image Captioning with a Self-Trained Reward Model

N Moratelli, M Cornia, L Baraldi… - arXiv preprint arXiv …, 2024 - arxiv.org
Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has
been a classical strategy for promoting caption quality at the sequence level. This approach …

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy and Novel Ensemble Method

U Berger, G Stanovsky, O Abend… - arXiv preprint arXiv …, 2024 - arxiv.org
The task of image captioning has recently been gaining popularity, and with it the complex
task of evaluating the quality of image captioning models. In this work, we present the first …