A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

[HTML][HTML] Transformers in medical image analysis

K He, C Gan, Z Li, I Rekik, Z Yin, W Ji, Y Gao, Q Wang… - Intelligent …, 2023 - Elsevier
Transformers have dominated the field of natural language processing and have recently
made an impact in the area of computer vision. In the field of medical image analysis …

Neural codec language models are zero-shot text to speech synthesizers

C Wang, S Chen, Y Wu, Z Zhang, L Zhou, S Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically,
we train a neural codec language model (called Vall-E) using discrete codes derived from …

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

[HTML][HTML] A survey of transformers

T Lin, Y Wang, X Liu, X Qiu - AI open, 2022 - Elsevier
Transformers have achieved great success in many artificial intelligence fields, such as
natural language processing, computer vision, and audio processing. Therefore, it is natural …

Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech

J Kim, J Kong, J Son - International Conference on Machine …, 2021 - proceedings.mlr.press
Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and
parallel sampling have been proposed, but their sample quality does not match that of two …

Motr: End-to-end multiple-object tracking with transformer

F Zeng, B Dong, Y Zhang, T Wang, X Zhang… - European Conference on …, 2022 - Springer
Temporal modeling of objects is a key challenge in multiple-object tracking (MOT). Existing
methods track by associating detections through motion-based and appearance-based …

Grad-tts: A diffusion probabilistic model for text-to-speech

V Popov, I Vovk, V Gogoryan… - International …, 2021 - proceedings.mlr.press
Recently, denoising diffusion probabilistic models and generative score matching have
shown high potential in modelling complex data distributions while stochastic calculus has …

Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arXiv preprint arXiv …, 2023 - arxiv.org
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arXiv preprint arXiv:2106.15561, 2021 - arxiv.org
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …