Practical Conformer: Optimizing size, speed and flops of Conformer for on-Device and cloud ASR

R Botros, A Gulati, TN Sainath, K Choromanski… - arXiv preprint arXiv …, 2023 - arxiv.org
arXiv preprint arXiv:2304.00171, 2023arxiv.org
Conformer models maintain a large number of internal states, the vast majority of which are
associated with self-attention layers. With limited memory bandwidth, reading these from
memory at each inference step can slow down inference. In this paper, we design an
optimized conformer that is small enough to meet on-device restrictions and has fast
inference on TPUs. We explore various ideas to improve the execution speed, including
replacing lower conformer blocks with convolution-only blocks, strategically downsizing the …
Conformer models maintain a large number of internal states, the vast majority of which are associated with self-attention layers. With limited memory bandwidth, reading these from memory at each inference step can slow down inference. In this paper, we design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs. We explore various ideas to improve the execution speed, including replacing lower conformer blocks with convolution-only blocks, strategically downsizing the architecture, and utilizing an RNNAttention-Performer. Our optimized conformer can be readily incorporated into a cascaded-encoder setting, allowing a second-pass decoder to operate on its output and improve the accuracy whenever more resources are available. Altogether, we find that these optimizations can reduce latency by a factor of 6.8x, and come at a reasonable trade-off in quality. With the cascaded second-pass, we show that the recognition accuracy is completely recoverable. Thus, our proposed encoder can double as a strong standalone encoder in on device, and as the first part of a high-performance ASR pipeline.
arxiv.org
以上显示的是最相近的搜索结果。 查看全部搜索结果