Grapheme-to-Phoneme Conversion with Convolutional Neural Networks

Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form. It has a highly essential role for natural language processing, text-to-speech synthesis and automatic speech recognition systems. In this paper, we investigate convolutional neural...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Applied sciences 2019, Vol.9 (6), p.1143
Hauptverfasser:	Yolchuyeva, Sevinj, Németh, Géza, Gyires-Tóth, Bálint
Format:	Artikel
Sprache:	eng
Schlagworte:	1D convolution Accuracy Acoustics Bi-LSTM Conversion encoder-decoder Encoders-Decoders Grapheme phoneme correspondence grapheme-to-phoneme (G2P) Language Linearity LSTM Machine translation Multilayers Natural language processing Neural networks Phonemes Phonetics residual architecture Speech Speech recognition Speech synthesis Synesthesia Text-to-speech Voice recognition
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form. It has a highly essential role for natural language processing, text-to-speech synthesis and automatic speech recognition systems. In this paper, we investigate convolutional neural networks (CNN) for G2P conversion. We propose a novel CNN-based sequence-to-sequence (seq2seq) architecture for G2P conversion. Our approach includes an end-to-end CNN G2P conversion with residual connections and, furthermore, a model that utilizes a convolutional neural network (with and without residual connections) as encoder and Bi-LSTM as a decoder. We compare our approach with state-of-the-art methods, including Encoder-Decoder LSTM and Encoder-Decoder Bi-LSTM. Training and inference times, phoneme and word error rates were evaluated on the public CMUDict dataset for US English, and the best performing convolutional neural network-based architecture was also evaluated on the NetTalk dataset. Our method approaches the accuracy of previous state-of-the-art results in terms of phoneme error rate.
ISSN:	2076-3417 2076-3417
DOI:	10.3390/app9061143