Once a popular theme of futuristic science fiction or far-fetched technology forecasts, digital home assistants with a spoken language interface have become a ubiquitous commodity …
End-to-end (E2E) models fold the acoustic, pronunciation and language models of a conventional speech recognition model into one neural network with a much smaller …
Language is more than words: it includes the prosodic features and patterns that we use, subconsciously, to frame meanings and achieve our goals in our interaction with others …
There is a growing interest in the speech community in developing Recurrent Neural Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications …
In this paper, we propose" personal VAD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming on …
AH Michaely, X Zhang, G Simko… - 2017 IEEE Automatic …, 2017 - ieeexplore.ieee.org
We present a novel keyword spotting (KWS) system that uses contextual automatic speech recognition (ASR). For voice-activated devices, it is common that a KWS system is run on the …
Voice activity detection (VAD) is the task of predicting which parts of an utterance contains speech versus background noise. It is an important first step to determine which samples to …
WR Huang, S Chang, D Rybach… - arXiv preprint arXiv …, 2022 - arxiv.org
Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours in length is an ongoing challenge in speech recognition. A common …
As speech-enabled devices such as smartphones and smart speakers become increasingly ubiquitous, there is growing interest in building automatic speech recognition (ASR) systems …