Two-Pass End-to-End ASR Model Compression
Speech recognition on smart devices is challenging owing to the small memory footprint. Hence small size ASR models are desirable. With the use of popular transducer-based models, it has become possible to practically deploy streaming speech recognition models on small devices [1]. Recently, the two...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Speech recognition on smart devices is challenging owing to the small memory
footprint. Hence small size ASR models are desirable. With the use of popular
transducer-based models, it has become possible to practically deploy streaming
speech recognition models on small devices [1]. Recently, the two-pass model
[2] combining RNN-T and LAS modules has shown exceptional performance for
streaming on-device speech recognition.
In this work, we propose a simple and effective approach to reduce the size
of the two-pass model for memory-constrained devices. We employ a popular
knowledge distillation approach in three stages using the Teacher-Student
training technique. In the first stage, we use a trained RNN-T model as a
teacher model and perform knowledge distillation to train the student RNN-T
model. The second stage uses the shared encoder and trains a LAS rescorer for
student model using the trained RNN-T+LAS teacher model. Finally, we perform
deep-finetuning for the student model with a shared RNN-T encoder, RNN-T
decoder, and LAS rescorer. Our experimental results on standard LibriSpeech
dataset show that our system can achieve a high compression rate of 55% without
significant degradation in the WER compared to the two-pass teacher model. |
---|---|
DOI: | 10.48550/arxiv.2201.02741 |