Partial Rewriting for Multi-Stage ASR
For many streaming automatic speech recognition tasks, it is important to provide timely intermediate streaming results, while refining a high quality final result. This can be done using a multi-stage architecture, where a small left-context only model creates streaming results and a larger left- a...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | For many streaming automatic speech recognition tasks, it is important to
provide timely intermediate streaming results, while refining a high quality
final result. This can be done using a multi-stage architecture, where a small
left-context only model creates streaming results and a larger left- and
right-context model produces a final result at the end. While this
significantly improves the quality of the final results without compromising
the streaming emission latency of the system, streaming results do not benefit
from the quality improvements. Here, we propose using a text manipulation
algorithm that merges the streaming outputs of both models. We improve the
quality of streaming results by around 10%, without altering the final results.
Our approach introduces no additional latency and reduces flickering. It is
also lightweight, does not require retraining the model, and it can be applied
to a wide variety of multi-stage architectures. |
---|---|
DOI: | 10.48550/arxiv.2312.09463 |