Improving End-to-End Contextual Speech Recognition with Fine-Grained Contextual Knowledge Selection
Nowadays, most methods in end-to-end contextual speech recognition bias the recognition process towards contextual knowledge. Since all-neural contextual biasing methods rely on phrase-level contextual modeling and attention-based relevance modeling, they may encounter confusion between similar cont...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Nowadays, most methods in end-to-end contextual speech recognition bias the
recognition process towards contextual knowledge. Since all-neural contextual
biasing methods rely on phrase-level contextual modeling and attention-based
relevance modeling, they may encounter confusion between similar
context-specific phrases, which hurts predictions at the token level. In this
work, we focus on mitigating confusion problems with fine-grained contextual
knowledge selection (FineCoS). In FineCoS, we introduce fine-grained knowledge
to reduce the uncertainty of token predictions. Specifically, we first apply
phrase selection to narrow the range of phrase candidates, and then conduct
token attention on the tokens in the selected phrase candidates. Moreover, we
re-normalize the attention weights of most relevant phrases in inference to
obtain more focused phrase-level contextual representations, and inject
position information to better discriminate phrases or tokens. On LibriSpeech
and an in-house 160,000-hour dataset, we explore the proposed methods based on
a controllable all-neural biasing method, collaborative decoding (ColDec). The
proposed methods provide at most 6.1% relative word error rate reduction on
LibriSpeech and 16.4% relative character error rate reduction on the in-house
dataset over ColDec. |
---|---|
DOI: | 10.48550/arxiv.2201.12806 |