A comprehensive survey of hallucination in large language, image, video and audio foundation models

P Sahoo, P Meharia, A Ghosh, S Saha… - Findings of the …, 2024 - aclanthology.org
The rapid advancement of foundation models (FMs) across language, image, audio, and
video domains has shown remarkable capabilities in diverse tasks. However, the …

Retrieval-augmented generation for ai-generated content: A survey

P Zhao, H Zhang, Q Yu, Z Wang, Y Geng, F Fu… - arXiv preprint arXiv …, 2024 - arxiv.org
The development of Artificial Intelligence Generated Content (AIGC) has been facilitated by
advancements in model algorithms, scalable foundation model architectures, and the …

A multifaceted vision of the Human-AI collaboration: a comprehensive review

M Puerta-Beldarrain, O Gómez-Carmona… - IEEE …, 2025 - ieeexplore.ieee.org
Human-AI collaboration has evolved into a complex, multidimensional paradigm shaped by
research in various domains. Key areas such as human-in-the-loop systems, Interactive …

Audiobox tta-rag: Improving zero-shot and few-shot text-to-audio with retrieval-augmented generation

M Yang, B Shi, M Le, WN Hsu, A Tjandra - arXiv preprint arXiv:2411.05141, 2024 - arxiv.org
Current leading Text-To-Audio (TTA) generation models suffer from degraded performance
on zero-shot and few-shot settings. It is often challenging to generate high-quality audio for …

Melody is all you need for music generation

S Wei, M Wei, H Wang, Y Zhao, G Kou - arXiv preprint arXiv:2409.20196, 2024 - arxiv.org
We present the Melody Guided Music Generation (MMGen) model, the first novel approach
using melody to guide the music generation that, despite a pretty simple method and …

FlowSep: Language-Queried Sound Separation with Rectified Flow Matching

Y Yuan, X Liu, H Liu, MD Plumbley, W Wang - arXiv preprint arXiv …, 2024 - arxiv.org
Language-queried audio source separation (LASS) focuses on separating sounds using
textual descriptions of the desired sources. Current methods mainly use discriminative …

Sound-VECaps: Improving Audio Generation With Visual Enhanced Captions

Y Yuan, D Jia, X Zhuang, Y Chen, Z Liu, Z Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Generative models have shown significant achievements in audio generation tasks.
However, existing models struggle with complex and detailed prompts, leading to potential …

Rhythmic foley: A framework for seamless audio-visual alignment in video-to-audio synthesis

Z Huang, D Luo, J Wang, H Liao, Z Li, Z Wu - arXiv preprint arXiv …, 2024 - arxiv.org
Our research introduces an innovative framework for video-to-audio synthesis, which solves
the problems of audio-video desynchronization and semantic loss in the audio. By …

Retrieval-Augmented Dialogue Knowledge Aggregation for expressive conversational speech synthesis

R Liu, Z Jia, F Bao, H Li - Information Fusion, 2025 - Elsevier
Conversational speech synthesis (CSS) aims to take the current dialogue (CD) history as a
reference to synthesize expressive speech that aligns with the conversational style. Unlike …

[PDF][PDF] BATON: aligning text-to-audio model using human preference feedback

H Liao, H Han, K Yang, T Du, R Yang, Q Xu… - Proceedings of the Thirty …, 2024 - ijcai.org
With the development of AI-Generated Content (AIGC), text-to-audio models are gaining
widespread attention. However, it is challenging for these models to generate audio aligned …