作者
Sung-Feng Huang, Chyi-Jiunn Lin, Da-Rong Liu, Yi-Chen Chen, Hung-yi Lee
发表日期
2022/4/13
期刊
IEEE/ACM Transactions on Audio, Speech, and Language Processing
卷号
30
页码范围
1558-1571
出版商
IEEE
简介
Personalizing a speech synthesis system is a highly desired application, where the system can generate speech with the user’s voice with rare enrolled recordings. There are two main approaches to build such a system in recent works: speaker adaptation and speaker encoding. On the one hand, speaker adaptation methods fine-tune a trained multi-speaker text-to-speech (TTS) model with few enrolled samples. However, they require at least thousands of fine-tuning steps for high-quality adaptation, making it hard to apply on devices. On the other hand, speaker encoding methods encode enrollment utterances into a speaker embedding. The trained TTS model can synthesize the user’s speech conditioned on the corresponding speaker embedding. Nevertheless, the speaker encoder suffers from the generalization gap between the seen and unseen speakers. In this paper, we propose applying a meta-learning …
引用总数
学术搜索中的文章
SF Huang, CJ Lin, DR Liu, YC Chen, H Lee - IEEE/ACM Transactions on Audio, Speech, and …, 2022