A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

STEMM: Self-learning with speech-text manifold mixup for speech translation

Q Fang, R Ye, L Li, Y Feng, M Wang - arXiv preprint arXiv:2203.10426, 2022 - arxiv.org
How to learn a better speech representation for end-to-end speech-to-text translation (ST)
with limited labeled data? Existing techniques often attempt to transfer powerful machine …

Recent advances in direct speech-to-text translation

C Xu, R Ye, Q Dong, C Zhao, T Ko, M Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
Recently, speech-to-text translation has attracted more and more attention and many studies
have emerged rapidly. In this paper, we present a comprehensive survey on direct speech …

Cross-modal contrastive learning for speech translation

R Ye, M Wang, L Li - arXiv preprint arXiv:2205.02444, 2022 - arxiv.org
How can we learn unified representations for spoken utterances and their written text?
Learning similar representations for semantically similar speech and text is important for …

Speechut: Bridging speech and text with hidden-unit for encoder-decoder based speech-text pre-training

Z Zhang, L Zhou, J Ao, S Liu, L Dai, J Li… - arXiv preprint arXiv …, 2022 - arxiv.org
The rapid development of single-modal pre-training has prompted researchers to pay more
attention to cross-modal pre-training methods. In this paper, we propose a unified-modal …

M3ST: Mix at Three Levels for Speech Translation

X Cheng, Q Dong, F Yue, T Ko… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's
well known that data augmentation is an efficient method to improve performance for many …

Revisiting end-to-end speech-to-text translation from scratch

B Zhang, B Haddow… - … conference on machine …, 2022 - proceedings.mlr.press
Abstract End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its
encoder and/or decoder using source transcripts via speech recognition or text translation …

CMOT: Cross-modal mixup via optimal transport for speech translation

Y Zhou, Q Fang, Y Feng - arXiv preprint arXiv:2305.14635, 2023 - arxiv.org
End-to-end speech translation (ST) is the task of translating speech signals in the source
language into text in the target language. As a cross-modal task, end-to-end ST is difficult to …

Pre-training for speech translation: Ctc meets optimal transport

PH Le, H Gong, C Wang, J Pino… - International …, 2023 - proceedings.mlr.press
The gap between speech and text modalities is a major challenge in speech-to-text
translation (ST). Different methods have been proposed to reduce this gap, but most of them …

Comsl: A composite speech-language model for end-to-end speech-to-text translation

C Le, Y Qian, L Zhou, S Liu, Y Qian… - Advances in Neural …, 2024 - proceedings.neurips.cc
Joint speech-language training is challenging due to the large demand for training data and
GPU consumption, as well as the modality gap between speech and language. We present …