Fairseq S2T: Fast speech-to-text modeling with fairseq

C Wang, Y Tang, X Ma, A Wu, S Popuri… - arXiv preprint arXiv …, 2020 - arxiv.org
We introduce fairseq S2T, a fairseq extension for speech-to-text (S2T) modeling tasks such
as end-to-end speech recognition and speech-to-text translation. It follows fairseq's careful …

STEMM: Self-learning with speech-text manifold mixup for speech translation

Q Fang, R Ye, L Li, Y Feng, M Wang - arXiv preprint arXiv:2203.10426, 2022 - arxiv.org
How to learn a better speech representation for end-to-end speech-to-text translation (ST)
with limited labeled data? Existing techniques often attempt to transfer powerful machine …

Learning shared semantic space for speech-to-text translation

C Han, M Wang, H Ji, L Li - arXiv preprint arXiv:2105.03095, 2021 - arxiv.org
Having numerous potential applications and great impact, end-to-end speech translation
(ST) has long been treated as an independent task, failing to fully draw strength from the …

Speech translation and the end-to-end promise: Taking stock of where we are

M Sperber, M Paulik - arXiv preprint arXiv:2004.06358, 2020 - arxiv.org
Over its three decade history, speech translation has experienced several shifts in its
primary research themes; moving from loosely coupled cascades of speech recognition and …

Voice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining

WC Huang, T Hayashi, YC Wu, H Kameoka… - arXiv preprint arXiv …, 2019 - arxiv.org
We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based
on the Transformer architecture with text-to-speech (TTS) pretraining. Seq2seq VC models …

SimulMT to SimulST: Adapting simultaneous text translation to end-to-end simultaneous speech translation

X Ma, J Pino, P Koehn - arXiv preprint arXiv:2011.02048, 2020 - arxiv.org
Simultaneous text translation and end-to-end speech translation have recently made great
progress but little work has combined these tasks together. We investigate how to adapt …

A study of transformer-based end-to-end speech recognition system for Kazakh language

M Orken, O Dina, A Keylan, T Tolganay, O Mohamed - Scientific reports, 2022 - nature.com
Today, the Transformer model, which allows parallelization and also has its own internal
attention, has been widely used in the field of speech recognition. The great advantage of …

CMOT: Cross-modal mixup via optimal transport for speech translation

Y Zhou, Q Fang, Y Feng - arXiv preprint arXiv:2305.14635, 2023 - arxiv.org
End-to-end speech translation (ST) is the task of translating speech signals in the source
language into text in the target language. As a cross-modal task, end-to-end ST is difficult to …

Listen, understand and translate: Triple supervision decouples end-to-end speech-to-text translation

Q Dong, R Ye, M Wang, H Zhou, S Xu, B Xu… - Proceedings of the AAAI …, 2021 - ojs.aaai.org
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs
the text in a target language. Existing methods are limited by the amount of parallel corpus …

A comparative study on end-to-end speech to text translation

P Bahar, T Bieschke, H Ney - 2019 IEEE Automatic Speech …, 2019 - ieeexplore.ieee.org
Recent advances in deep learning show that end-to-end speech to text translation model is
a promising approach to direct the speech translation field. In this work, we provide an …