作者
Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D Plumbley, Wenwu Wang
发表日期
2022/5/23
研讨会论文
ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
页码范围
8882-8886
出版商
IEEE
简介
Audio captioning aims at generating natural language descriptions for audio clips automatically. Existing audio captioning models have shown promising improvement in recent years. However, these models are mostly trained via maximum likelihood estimation (MLE), which tends to make captions generic, simple and deterministic. As different people may describe an audio clip from different aspects using distinct words and grammars, we argue that an audio captioning system should have the ability to generate diverse captions for a fixed audio clip and across similar audio clips. To address this problem, we propose an adversarial training framework for audio captioning based on a conditional generative adversarial network (C-GAN), which aims at improving the naturalness and diversity of generated captions. Unlike processing data of continuous values in a classical GAN, a sentence is composed of discrete …
引用总数
学术搜索中的文章
X Mei, X Liu, J Sun, MD Plumbley, W Wang - ICASSP 2022-2022 IEEE International Conference on …, 2022