Assisting persons with speech impediments using Recurrent Neural Networks

Purpose This work focuses on the research related to enabling individuals with speech impairment to use speech-to-text software to recognize and dictate their speech. Automatic Speech Recognition (ASR) tends to be a challenging problem for researchers because of the wide range of speech variability....

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Gerontechnology 2018-04, Vol.17 (s), p.105-105
Hauptverfasser: Mounir, R., Alqasemi, R., Dubey, R.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Purpose This work focuses on the research related to enabling individuals with speech impairment to use speech-to-text software to recognize and dictate their speech. Automatic Speech Recognition (ASR) tends to be a challenging problem for researchers because of the wide range of speech variability. Some of the variabilities include different accents, pronunciations, speeds, volumes, etc. It is very difficult to train an end-to-end speech recognition model on data with speech impediment due to the lack of large enough datasets, and the difficulty of generalizing a speech disorder pattern on all users with speech impediments. This work highlights the different techniques used in deep learning to achieve ASR and how it can be modified to recognize and dictate speech from individuals with speech impediments. Method The project is split into three consecutive processes; ASR to phonetic transcription, edit distance and language model. The ASR is the most challenging due to the complexity of the neural network architecture and the preprocessing involved. We apply Mel-Frequency Cepstrum Coefficients (MFCC)' to each audio file which results in 13 coefficients for each frame. The labels (text matching the audio) is converted to phonemes using the CMU arpabet phonetic dictionarylThe Network is trained using the MFCC coefficients as inputs and phonemes' IDs as outputs. The Network architecture implemented is a Bidirectional Recurrent Deep Neural Network (BRDNN)*, it consists of 2 (one in each direction) LSTM cells with 100 hidden blocks in each direction. The network is made deep by stacking two more layers, which results in a 3 layers network in depth. Two fully connected layers were attached to the output of the recurrent network with 128 hidden units in each. This architecture resulted in a 38.5% Label Error Rate
ISSN:1569-1101
1569-111X
DOI:10.4017/gt.2018.17.s.102.00