Customized keyword spotting aims to detect user-defined keywords from continuous speech, providing flexibility and personalization. Previous research mainly relied on similarity calculations between keyword text and acoustic features. However, due to the gap between the two modalities, it is challenging to obtain alignment information and model their correlation. In our paper, we propose a novel method to address these issues. Firstly, we introduce a text-to-speech (TTS) module to generate the audio of keywords, effectively addressing the cross-modal challenge of text-based customized keyword spotting. Furthermore, we employ the Continuous Integrate-and-Fire (CIF) mechanism for boundary prediction to get token-level acoustic representations of keywords thus solving the keyword and speech alignment problem. Our experimental results on the Aishell-1 dataset demonstrate the effectiveness of our proposed method. It significantly outperforms both the baseline method and the Dynamic Sequence Partitioning (DSP) method in terms of keyword spotting accuracy. Compared with the DSP method, our model can achieve a significant improvement in the relative wake-up rate of 72.7% when the false accept rate is fixed at 0.02. And our model represents a 64% improvement over the baseline model.