Building synthetic voices for under-resourced languages: The feasibility of using audiobook data

F de Wet, W Van der Walt, N Dlamini… - … Association of South …, 2017 - ieeexplore.ieee.org
F de Wet, W Van der Walt, N Dlamini, A Govender
2017 Pattern Recognition Association of South Africa and Robotics …, 2017ieeexplore.ieee.org
Creating synthetic voices that are natural and intelligible is a daunting challenge for well-
resourced languages. The challenge is much bigger for languages in which the speech and
text resources required for voice development are not available. Previous studies have
suggested audiobooks as an alternative source of speech data. This paper reports on a
comparison between voices derived from audiobook data and voices based on professional
voice artist data. Two sets of voices were evaluated: male voices built using very small …
Creating synthetic voices that are natural and intelligible is a daunting challenge for well-resourced languages. The challenge is much bigger for languages in which the speech and text resources required for voice development are not available. Previous studies have suggested audiobooks as an alternative source of speech data. This paper reports on a comparison between voices derived from audiobook data and voices based on professional voice artist data. Two sets of voices were evaluated: male voices built using very small amounts of both data types (around 3 hours, representing a severely resource constrained scenario) and female voices trained on almost 10 hours of audiobook and professional speech data. The results of subjective listening tests indicate that, while the majority of the listeners preferred the voice artists' voices over the audiobook voices, the difference in naturalness was not perceived to be substantial. Results also showed that the artists' voices outperform the audiobook voices in terms of intelligibility, especially if a limited amount of training data is available. Although additional training data improves the intelligibility of audiobook voices, the results seem to indicate that a smaller quantity of professional data yields a better voice than large volumes of especially old audiobook data.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果