Accurate synthesis of dysarthric Speech for ASR data augmentation

•Modified a neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels.•Providing data augmentation for machine learning tasks such as automatic speech recognition.•Evaluating the effectiveness of...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Speech communication 2024-10, Vol.164, p.103112, Article 103112
Hauptverfasser: Soleymanpour, Mohammad, Johnson, Michael T., Soleymanpour, Rahim, Berry, Jeffrey
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•Modified a neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels.•Providing data augmentation for machine learning tasks such as automatic speech recognition.•Evaluating the effectiveness of the approach for dysarthria-specific speech recognition on the TORGO dataset, with results provided for two different experiments: the first includes augmented speech across 3 severities with pause insertion, and the second includes augmented speech with across a larger number of variables that include severity, pause, pitch, energy, and duration.•A relative improvement of 12.2 % on word error rate (WER), demonstrating that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has the potential for significant impact on dysarthric ASR systems.•Two subjective evaluations of the synthesized dysarthric speech are provided. This includes an evaluation of Dysarthric-ness that shows that the perceived level of the dysarthria tracks with the target synthesized dysarthric severity, as well as an evaluation of speaker similarity that shows a higher rating of similarity between synthesis target speaker and actual speaker when these are the same individual.•A demonstration web page with audio results of the synthesis is available at https://mohammadelc.github.io/SpeechGroupUKY/. Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech Recognition (ASR) systems can help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech, which is not readily available for dysarthric talkers. This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. Differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels are important components for dysarthric speech modeling, synthesis, and augmentation. For dysarthric speech synthesis, a modified neural multi-talker TTS is implemented by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. To evaluate the effectiveness for synthesis of training data for ASR, dysarthria-specific speech recognition was used. Results show
ISSN:0167-6393
DOI:10.1016/j.specom.2024.103112