Selective listening by synchronizing speech with lips

Z Pan, R Tao, C Xu, H Li - IEEE/ACM Transactions on Audio …, 2022 - ieeexplore.ieee.org
A speaker extraction algorithm seeks to extract the speech of a target speaker from a multi-
talker speech mixture when given a cue that represents the target speaker, such as a pre …

Reinforcement learning for emotional text-to-speech synthesis with improved emotion discriminability

R Liu, B Sisman, H Li - arXiv preprint arXiv:2104.01408, 2021 - arxiv.org
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
However, the generated voice is often not perceptually identifiable by its intended emotion …

Language agnostic speaker embedding for cross-lingual personalized speech generation

Y Zhou, X Tian, H Li - IEEE/ACM Transactions on Audio …, 2021 - ieeexplore.ieee.org
Cross-lingual personalized speech generation seeks to synthesize a target speaker's voice
from only a few training samples that are in a different language. One popular technique is to …

Speech separation with pretrained frontend to minimize domain mismatch

W Wang, Z Pan, X Li, S Wang… - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org
Speech separation seeks to separate individual speech signals from a speech mixture.
Typically, most separation models are trained on synthetic data due to the unavailability of …

Optimization of cross-lingual voice conversion with linguistics losses to reduce foreign accents

Y Zhou, Z Wu, X Tian, H Li - IEEE/ACM Transactions on Audio …, 2023 - ieeexplore.ieee.org
Cross-lingual voice conversion (XVC) transforms the speaker identity of a source speaker to
that of a target speaker who speaks a different language. Due to the intrinsic differences …

Spike-event-driven deep spiking neural network with temporal encoding

Z Zhang, Q Liu - IEEE Signal Processing Letters, 2021 - ieeexplore.ieee.org
Feature extractionplays an important role before pattern recognition takes place. The
existing artificial neural networks (ANNs), however, ignoreto learn and represent temporal …

Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions

S Ghosh, S Sarkar, S Ghosh, F Zalkow, ND Jana - Applied Intelligence, 2024 - Springer
Audio-visual speech synthesis (AVSS) has garnered attention in recent years for its utility in
the realm of audio-visual learning. AVSS transforms one speaker's speech into another's …

[PDF][PDF] Cross-Lingual Voice Conversion with a Cycle Consistency Loss on Linguistic Representation.

Y Zhou, X Tian, Z Wu, H Li - Interspeech, 2021 - isca-archive.org
Abstract Cross-Lingual Voice Conversion (XVC) aims to modify a source speaker identity
towards a target while preserving the source linguistic content. This paper introduces a cycle …

[PDF][PDF] A multi-task and transfer learning based approach for MOS prediction

X Tian, K Fu, S Gao, Y Gu, K Wang, W Li, Z Ma… - 2022 - isca-archive.org
Automatic speech quality assessment aims to train a model capable of automatically
measuring the performance of synthesis systems. This is a challenging task, especially …

MaskMel-Prosody-CycleGAN-VC: High-Quality Cross-Lingual Voice Conversion

S Yan, S Chen, Y Xu, D Ke - International Conference on Artificial …, 2023 - Springer
Voice conversion aims to change the timber of the source speaker to that of the target
speaker without changing the speech content. The cross-lingual voice conversion requires …