Robust speech recognition via large-scale weak supervision

A Radford, JW Kim, T Xu, G Brockman… - International …, 2023 - proceedings.mlr.press
We study the capabilities of speech processing systems trained simply to predict large
amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual …

Wavlm: Large-scale self-supervised pre-training for full stack speech processing

S Chen, C Wang, Z Chen, Y Wu, S Liu… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Self-supervised learning (SSL) achieves great success in speech recognition, while limited
exploration has been attempted for other speech processing tasks. As speech signal …

Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition

Y Zhang, DS Park, W Han, J Qin… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
We summarize the results of a host of efforts using giant automatic speech recognition (ASR)
models pre-trained using large, diverse unlabeled datasets containing approximately a …

Deep learning for audio signal processing

H Purwins, B Li, T Virtanen, J Schlüter… - IEEE Journal of …, 2019 - ieeexplore.ieee.org
Given the recent surge in developments of deep learning, this paper provides a review of the
state-of-the-art deep learning techniques for audio signal processing. Speech, music, and …

Speechstew: Simply mix all available speech recognition data to train one large neural network

W Chan, D Park, C Lee, Y Zhang, Q Le… - arXiv preprint arXiv …, 2021 - arxiv.org
We present SpeechStew, a speech recognition model that is trained on a combination of
various publicly available speech recognition datasets: AMI, Broadcast News, Common …

Specaugment on large scale datasets

DS Park, Y Zhang, CC Chiu, Y Chen… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
Recently, SpecAugment, an augmentation scheme for automatic speech recognition that
acts directly on the spectrogram of input utterances, has shown to be highly effective in …

Towards fast and accurate streaming end-to-end ASR

B Li, S Chang, TN Sainath, R Pang… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
End-to-end (E2E) models fold the acoustic, pronunciation and language models of a
conventional speech recognition model into one neural network with a much smaller …

Recognizing long-form speech using streaming end-to-end models

A Narayanan, R Prabhavalkar, CC Chiu… - 2019 IEEE automatic …, 2019 - ieeexplore.ieee.org
All-neural end-to-end (E2E) automatic speech recognition (ASR) systems that use a single
neural network to transduce audio to word sequences have been shown to achieve state-of …

Fastemit: Low-latency streaming asr with sequence-level emission regularization

J Yu, CC Chiu, B Li, S Chang… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as
quickly and accurately as possible. However, emitting fast without degrading quality, as …

Dual-mode ASR: Unify and improve streaming ASR with full-context modeling

J Yu, W Han, A Gulati, CC Chiu, B Li… - arXiv preprint arXiv …, 2020 - arxiv.org
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as
quickly and accurately as possible, while full-context ASR waits for the completion of a full …