An overview of voice conversion and its challenges: From statistical modeling to deep learning

B Sisman, J Yamagishi, S King… - IEEE/ACM Transactions …, 2020 - ieeexplore.ieee.org
Speaker identity is one of the important characteristics of human speech. In voice
conversion, we change the speaker identity from one to another, while keeping the linguistic …

[PDF][PDF] Speech synthesis evaluation—state-of-the-art assessment and suggestion for a novel research program

P Wagner, J Beskow, S Betz, J Edlund… - Proceedings of the 10th …, 2019 - core.ac.uk
Speech synthesis applications have become an ubiquity, in navigation systems, digital
assistants or as screen or audio book readers. Despite their impact on the acceptability of …

The voicemos challenge 2022

WC Huang, E Cooper, Y Tsao, HM Wang… - arXiv preprint arXiv …, 2022 - arxiv.org
We present the first edition of the VoiceMOS Challenge, a scientific event that aims to
promote the study of automatic prediction of the mean opinion score (MOS) of synthetic …

Mosnet: Deep learning based objective assessment for voice conversion

CC Lo, SW Fu, WC Huang, X Wang… - arXiv preprint arXiv …, 2019 - arxiv.org
Existing objective evaluation metrics for voice conversion (VC) are not always correlated
with human perception. Therefore, training VC models with such criteria may not effectively …

Utmos: Utokyo-sarulab system for voicemos challenge 2022

T Saeki, D Xin, W Nakata, T Koriyama… - arXiv preprint arXiv …, 2022 - arxiv.org
We present the UTokyo-SaruLab mean opinion score (MOS) prediction system submitted to
VoiceMOS Challenge 2022. The challenge is to predict the MOS values of speech samples …

MBNet: MOS prediction for synthesized speech with mean-bias network

Y Leng, X Tan, S Zhao, F Soong… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
Mean opinion score (MOS) is a popular subjective metric to assess the quality of
synthesized speech, and usually involves multiple human judges to evaluate each speech …

CDPAM: Contrastive learning for perceptual audio similarity

P Manocha, Z Jin, R Zhang… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
Many speech processing methods based on deep learning require an automatic and
differentiable audio metric for the loss function. The DPAM approach of Manocha et al.[1] …

Deep learning based assessment of synthetic speech naturalness

G Mittag, S Möller - arXiv preprint arXiv:2104.11673, 2021 - arxiv.org
In this paper, we present a new objective prediction model for synthetic speech naturalness.
It can be used to evaluate Text-To-Speech or Voice Conversion systems and works …

Utilizing self-supervised representations for MOS prediction

WC Tseng, C Huang, WT Kao, YY Lin, H Lee - arXiv preprint arXiv …, 2021 - arxiv.org
Speech quality assessment has been a critical issue in speech processing for decades.
Existing automatic evaluations usually require clean references or parallel ground truth data …

NORESQA: A framework for speech quality assessment using non-matching references

P Manocha, B Xu, A Kumar - Advances in neural …, 2021 - proceedings.neurips.cc
The perceptual task of speech quality assessment (SQA) is a challenging task for machines
to do. Objective SQA methods that rely on the availability of the corresponding clean …