[PDF][PDF] How to train your speaker embeddings extractor

ML McLaren, D Castan, MK Nandwana, L Ferrer… - 2018 - repository.ubn.ru.nl
2018repository.ubn.ru.nl
With the recent introduction of speaker embeddings for text-independent speaker
recognition, many fundamental questions require addressing in order to fast-track the
development of this new era of technology. Of particular interest is the ability of the speaker
embeddings network to leverage artificially degraded data at a far greater rate beyond prior
technologies, even in the evaluation of naturally degraded data. In this study, we aim to
explore some of the fundamental requirements for building a good speaker embeddings …
Abstract
With the recent introduction of speaker embeddings for text-independent speaker recognition, many fundamental questions require addressing in order to fast-track the development of this new era of technology. Of particular interest is the ability of the speaker embeddings network to leverage artificially degraded data at a far greater rate beyond prior technologies, even in the evaluation of naturally degraded data. In this study, we aim to explore some of the fundamental requirements for building a good speaker embeddings extractor. We analyze the impact of voice activity detection, types of degradation, the amount of degraded data, and number of speakers required for a good network. These aspects are analyzed over a large set of 11 conditions from 7 evaluation datasets. We lay out a set of recommendations for training the network based on the observed trends. By applying these recommendations to enhance the default recipe provided in the Kaldi toolkit, a significant gain of 13-21% on the Speakers in the Wild and NIST SRE’16 datasets is achieved.
repository.ubn.ru.nl
以上显示的是最相近的搜索结果。 查看全部搜索结果