Leveraging Synthetic Speech for CIF-Based Customized Keyword Spotting

S Liu, A Zhang, K Huang, L Xie - National Conference on Man-Machine …, 2023 - Springer
S Liu, A Zhang, K Huang, L Xie
National Conference on Man-Machine Speech Communication, 2023Springer
Customized keyword spotting aims to detect user-defined keywords from continuous
speech, providing flexibility and personalization. Previous research mainly relied on
similarity calculations between keyword text and acoustic features. However, due to the gap
between the two modalities, it is challenging to obtain alignment information and model their
correlation. In our paper, we propose a novel method to address these issues. Firstly, we
introduce a text-to-speech (TTS) module to generate the audio of keywords, effectively …
Abstract
Customized keyword spotting aims to detect user-defined keywords from continuous speech, providing flexibility and personalization. Previous research mainly relied on similarity calculations between keyword text and acoustic features. However, due to the gap between the two modalities, it is challenging to obtain alignment information and model their correlation. In our paper, we propose a novel method to address these issues. Firstly, we introduce a text-to-speech (TTS) module to generate the audio of keywords, effectively addressing the cross-modal challenge of text-based customized keyword spotting. Furthermore, we employ the Continuous Integrate-and-Fire (CIF) mechanism for boundary prediction to get token-level acoustic representations of keywords thus solving the keyword and speech alignment problem. Our experimental results on the Aishell-1 dataset demonstrate the effectiveness of our proposed method. It significantly outperforms both the baseline method and the Dynamic Sequence Partitioning (DSP) method in terms of keyword spotting accuracy. Compared with the DSP method, our model can achieve a significant improvement in the relative wake-up rate of 72.7% when the false accept rate is fixed at 0.02. And our model represents a 64% improvement over the baseline model.
Springer
以上显示的是最相近的搜索结果。 查看全部搜索结果