Gammatonegram representation for end-to-end dysarthric speech processing tasks: speech recognition, speaker identification, and intelligibility assessment
Dysarthria is a disability that causes a disturbance in the human speech system and reduces the quality and intelligibility of a person’s speech. Because of this effect, the normal speech processing systems cannot work correctly on this impaired speech. This disability is usually associated with phy...
Gespeichert in:
Veröffentlicht in: | Iran Journal of Computer Science (Online) 2024-06, Vol.7 (2), p.311-324 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Dysarthria is a disability that causes a disturbance in the human speech system and reduces the quality and intelligibility of a person’s speech. Because of this effect, the normal speech processing systems cannot work correctly on this impaired speech. This disability is usually associated with physical disabilities. Therefore, designing a system that can perform some tasks by receiving voice commands in the smart home can be a significant achievement. In this work, we introduce Gammatonegram as an effective method to represent audio files with discriminative details, which can be used as input for convolutional neural networks. In other words, we convert each speech file into an image and propose an image recognition system to classify speech in different scenarios. The proposed convolutional neural networks are based on the transfer learning method on the pre-trained Alexnet. This research evaluates the efficiency of the proposed system for speech recognition, speaker identification, and intelligibility assessment tasks. According to the results on the UA speech dataset, the proposed speech recognition system achieved a 91.29% word recognition rate in speaker-dependent mode, the speaker identification system acquired an 87.74% recognition rate in text-dependent mode, and the intelligibility assessment system achieved a 96.47% recognition rate in two-class mode. Finally, we propose a multi-network speech recognition system that works fully automatically. This system is located in a cascade arrangement with the two-class intelligibility assessment system, and the output of this system activates each one of the speech recognition networks. This architecture achieves a word recognition rate of 92.3%. |
---|---|
ISSN: | 2520-8438 2520-8446 |
DOI: | 10.1007/s42044-024-00175-y |