Language agnostic speaker embedding for cross-lingual personalized speech generation

Y Zhou, X Tian, H Li - IEEE/ACM Transactions on Audio …, 2021 - ieeexplore.ieee.org
Cross-lingual personalized speech generation seeks to synthesize a target speaker's voice
from only a few training samples that are in a different language. One popular technique is to …

Optimizing voice conversion network with cycle consistency loss of speaker identity

H Du, X Tian, L Xie, H Li - 2021 IEEE Spoken language …, 2021 - ieeexplore.ieee.org
We propose a novel training scheme to optimize voice conversion network with a speaker
identity loss function. The training scheme not only minimizes frame-level spectral loss, but …

Self-supervised training of speaker encoder with multi-modal diverse positive pairs

R Tao, KA Lee, RK Das… - IEEE/ACM Transactions …, 2023 - ieeexplore.ieee.org
We study a novel neural speaker encoder and its training strategies for speaker recognition
without using any identity labels. The speaker encoder is trained to extract a fixed …

A modularized neural network with language-specific output layers for cross-lingual voice conversion

Y Zhou, X Tian, E Yılmaz, RK Das… - 2019 IEEE Automatic …, 2019 - ieeexplore.ieee.org
This paper presents a cross-lingual voice conversion framework that adopts a modularized
neural network. The modularized neural network has a common input structure that is …

[PDF][PDF] Cross-Lingual Voice Conversion with a Cycle Consistency Loss on Linguistic Representation.

Y Zhou, X Tian, Z Wu, H Li - Interspeech, 2021 - isca-archive.org
Abstract Cross-Lingual Voice Conversion (XVC) aims to modify a source speaker identity
towards a target while preserving the source linguistic content. This paper introduces a cycle …

Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

H Du, L Xie - arXiv preprint arXiv:2106.10406, 2021 - arxiv.org
One-shot voice conversion has received significant attention since only one utterance from
source speaker and target speaker respectively is required. Moreover, source speaker and …

Self-Supervised Training of Speaker Encoder with Multi-Modal Diverse Positive Pairs

R Tao, KA Lee, RK Das, V Hautamäki, H Li - arXiv preprint arXiv …, 2022 - arxiv.org
We study a novel neural architecture and its training strategies of speaker encoder for
speaker recognition without using any identity labels. The speaker encoder is trained to …

Transfer learning from monolingual asr to transcription-free cross-lingual voice conversion

CJ Chang - arXiv preprint arXiv:2009.14668, 2020 - arxiv.org
Cross-lingual voice conversion (VC) is a task that aims to synthesize target voices with the
same content while source and target speakers speak in different languages. Its challenge …

MaskMel-Prosody-CycleGAN-VC: High-Quality Cross-Lingual Voice Conversion

S Yan, S Chen, Y Xu, D Ke - International Conference on Artificial …, 2023 - Springer
Voice conversion aims to change the timber of the source speaker to that of the target
speaker without changing the speech content. The cross-lingual voice conversion requires …

Audio-Visual Active Speaker Detection and Recognition

T Ruijie - 2023 - search.proquest.com
In our daily life, humans can recognize the person based on their facial and voice
characteristics. Research in biology has proved that speech and face modalities can provide …