Distillation Strategies for Discriminative Speech Recognition Rescoring
Second-pass rescoring is employed in most state-of-the-art speech recognition systems. Recently, BERT based models have gained popularity for re-ranking the n-best hypothesis by exploiting the knowledge from masked language model pre-training. Further, fine-tuning with discriminative loss such as mi...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Second-pass rescoring is employed in most state-of-the-art speech recognition
systems. Recently, BERT based models have gained popularity for re-ranking the
n-best hypothesis by exploiting the knowledge from masked language model
pre-training. Further, fine-tuning with discriminative loss such as minimum
word error rate (MWER) has shown to perform better than likelihood-based loss.
Streaming applications with low latency requirements impose significant
constraints on the size of the models, thereby limiting the word error rate
(WER) performance gains. In this paper, we propose effective strategies for
distilling from large models discriminatively trained with the MWER objective.
We experiment on Librispeech and production scale internal dataset for
voice-assistant. Our results demonstrate relative improvements of upto 7% WER
over student models trained with MWER. We also show that the proposed
distillation can reduce the WER gap between the student and the teacher by 62%
upto 100%. |
---|---|
DOI: | 10.48550/arxiv.2306.09452 |