[PDF][PDF] Recent advances in end-to-end automatic speech recognition

J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com
Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …

Scaling speech technology to 1,000+ languages

V Pratap, A Tjandra, B Shi, P Tomasello, A Babu… - Journal of Machine …, 2024 - jmlr.org
Expanding the language coverage of speech technology has the potential to improve
access to information for many more people. However, current speech technology is …

Robust speech recognition via large-scale weak supervision

A Radford, JW Kim, T Xu, G Brockman… - International …, 2023 - proceedings.mlr.press
We study the capabilities of speech processing systems trained simply to predict large
amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual …

XLS-R: Self-supervised cross-lingual speech representation learning at scale

A Babu, C Wang, A Tjandra, K Lakhotia, Q Xu… - arXiv preprint arXiv …, 2021 - arxiv.org
This paper presents XLS-R, a large-scale model for cross-lingual speech representation
learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a …

Emerging trends: A gentle introduction to fine-tuning

KW Church, Z Chen, Y Ma - Natural Language Engineering, 2021 - cambridge.org
The previous Emerging Trends article (Church et al., 2021. Natural Language
Engineering27 (5), 631–645.) introduced deep nets to poets. Poets is an imperfect …

Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding

Y Peng, S Dalmia, I Lane… - … Conference on Machine …, 2022 - proceedings.mlr.press
Conformer has proven to be effective in many speech processing tasks. It combines the
benefits of extracting local dependencies using convolutions and global dependencies …

Layer-wise analysis of a self-supervised speech representation model

A Pasad, JC Chou, K Livescu - 2021 IEEE Automatic Speech …, 2021 - ieeexplore.ieee.org
Recently proposed self-supervised learning approaches have been successful for pre-
training speech representation models. The utility of these learned representations has been …

Torchaudio: Building blocks for audio and speech processing

YY Yang, M Hira, Z Ni, A Astafurov… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
This document describes version 0.10 of TorchAudio: building blocks for machine learning
applications in the audio and speech processing domain. The objective of TorchAudio is to …

Transformer-based multimodal information fusion for facial expression analysis

W Zhang, F Qiu, S Wang, H Zeng… - Proceedings of the …, 2022 - openaccess.thecvf.com
Human affective behavior analysis has received much attention in human-computer
interaction (HCI). In this paper, we introduce our submission to the CVPR 2022 Competition …

Speech emotion recognition using self-supervised features

E Morais, R Hoory, W Zhu, I Gat… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
Self-supervised pre-trained features have consistently delivered state-of-art results in the
field of natural language processing (NLP); however, their merits in the field of speech …