EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed
Non-autoregressive (NAR) automatic speech recognition (ASR) models predict tokens independently and simultaneously, bringing high inference speed. However, there is still a gap in the accuracy of the NAR models compared to the autoregressive (AR) models. In this paper, we propose a single-step NAR A...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Non-autoregressive (NAR) automatic speech recognition (ASR) models predict
tokens independently and simultaneously, bringing high inference speed.
However, there is still a gap in the accuracy of the NAR models compared to the
autoregressive (AR) models. In this paper, we propose a single-step NAR ASR
architecture with high accuracy and inference speed, called EffectiveASR. It
uses an Index Mapping Vector (IMV) based alignment generator to generate
alignments during training, and an alignment predictor to learn the alignments
for inference. It can be trained end-to-end (E2E) with cross-entropy loss
combined with alignment loss. The proposed EffectiveASR achieves competitive
results on the AISHELL-1 and AISHELL-2 Mandarin benchmarks compared to the
leading models. Specifically, it achieves character error rates (CER) of
4.26%/4.62% on the AISHELL-1 dev/test dataset, which outperforms the AR
Conformer with about 30x inference speedup. |
---|---|
DOI: | 10.48550/arxiv.2406.08835 |