Maximum-a-Posteriori-Based Decoding for End-to-End Acoustic Models

This paper presents a novel decoding framework for acoustic models (AMs) based on end-to-end neural networks (e.g., connectionist temporal classification). The end-to-end training of AMs has recently demonstrated high accuracy and efficiency in automatic speech recognition (ASR). When using the trai...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2017-05, Vol.25 (5), p.1023-1034
Hauptverfasser:	Kanda, Naoyuki, Xugang Lu, Kawai, Hisashi
Format:	Artikel
Sprache:	eng
Schlagworte:	Acoustic modeling Acoustics Automatic speech recognition connectionist temporal classification Corpus linguistics Data models Decoding end-to-end neural network Hidden Markov models Interpolation Japanese language Language modeling Neural networks Speech Speech recognition Spontaneous speech Training Voice recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This paper presents a novel decoding framework for acoustic models (AMs) based on end-to-end neural networks (e.g., connectionist temporal classification). The end-to-end training of AMs has recently demonstrated high accuracy and efficiency in automatic speech recognition (ASR). When using the trained AM in decoding, although a language model (LM) is implicitly involved in such an end-to-end AM, it is still essential to integrate an external LM trained with a large text corpus to achieve the best results. While there is no theoretical justification, most of the studies suggest using a naive interpolation of the end-to-end AM score and the external LM score, empirically. In this paper, we propose a more theoretically sound decoding framework derived from a maximization of the posterior probability of a word sequence given an observation. As a consequence of the theory, the subword LM is newly introduced to seamlessly integrate the external LM score with the end-to-end AM score. Our proposed method can be achieved by a small modification of the conventional weighted finite-state transducer-based implementation, without having to heavily increase the graph size. We tested the proposed decoding framework on ASR experiments with the Corpus of the Wall Street Journal and the Corpus of Spontaneous Japanese. The results showed that the proposed framework achieved significant and consistent improvements over the conventional interpolation-based decoding framework.
ISSN:	2329-9290 2329-9304
DOI:	10.1109/TASLP.2017.2678162