Maximum-a-Posteriori-Based Decoding for End-to-End Acoustic Models
This paper presents a novel decoding framework for acoustic models (AMs) based on end-to-end neural networks (e.g., connectionist temporal classification). The end-to-end training of AMs has recently demonstrated high accuracy and efficiency in automatic speech recognition (ASR). When using the trai...
Gespeichert in:
Veröffentlicht in: | IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2017-05, Vol.25 (5), p.1023-1034 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This paper presents a novel decoding framework for acoustic models (AMs) based on end-to-end neural networks (e.g., connectionist temporal classification). The end-to-end training of AMs has recently demonstrated high accuracy and efficiency in automatic speech recognition (ASR). When using the trained AM in decoding, although a language model (LM) is implicitly involved in such an end-to-end AM, it is still essential to integrate an external LM trained with a large text corpus to achieve the best results. While there is no theoretical justification, most of the studies suggest using a naive interpolation of the end-to-end AM score and the external LM score, empirically. In this paper, we propose a more theoretically sound decoding framework derived from a maximization of the posterior probability of a word sequence given an observation. As a consequence of the theory, the subword LM is newly introduced to seamlessly integrate the external LM score with the end-to-end AM score. Our proposed method can be achieved by a small modification of the conventional weighted finite-state transducer-based implementation, without having to heavily increase the graph size. We tested the proposed decoding framework on ASR experiments with the Corpus of the Wall Street Journal and the Corpus of Spontaneous Japanese. The results showed that the proposed framework achieved significant and consistent improvements over the conventional interpolation-based decoding framework. |
---|---|
ISSN: | 2329-9290 2329-9304 |
DOI: | 10.1109/TASLP.2017.2678162 |