Whispered Speech Recognition Using Deep Denoising Autoencoder and Inverse Filtering
Due to the profound differences between acoustic characteristics of neutral and whispered speech, the performance of traditional automatic speech recognition (ASR) systems trained on neutral speech degrades significantly when whisper is applied. In order to deeply analyze this mismatched train/test...
Gespeichert in:
Veröffentlicht in: | IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2017-12, Vol.25 (12), p.2313-2322 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Due to the profound differences between acoustic characteristics of neutral and whispered speech, the performance of traditional automatic speech recognition (ASR) systems trained on neutral speech degrades significantly when whisper is applied. In order to deeply analyze this mismatched train/test situation and to develop an efficient way for whisper recognition, this study first analyzes acoustic characteristics of whispered speech, addresses the problems of whispered speech recognition in mismatched conditions, and then proposes a new robust cepstral features and preprocessing approach based on deep denoising autoencoder (DDAE) that enhance whisper recognition. The experimental results confirm that Teager-energy-based cepstral features, especially TECCs, are more robust and better whisper descriptors than traditional Mel-frequency cepstral coefficients (MFCC). Further detailed analysis of cepstral distances, distributions of cepstral coefficients, confusion matrices, and experiments with inverse filtering, prove that voicing in speech stimuli is the main cause of word misclassification in mismatched train/test scenarios. The new framework based on DDAE and TECC feature, significantly improves whisper recognition accuracy and outperforms traditional MFCC and GMM-HMM (Gaussian mixture density-Hidden Markov model) baseline, resulting in an absolute 31% improvement of whisper recognition accuracy. The achieved word recognition rate in neutral/whisper scenario is 92.81%. |
---|---|
ISSN: | 2329-9290 2329-9304 |
DOI: | 10.1109/TASLP.2017.2738559 |