Performance Estimation of Speech Recognition System Under Noise Conditions Using Objective Quality Measures and Artificial Voice
It is essential to ensure quality of service (QoS) when offering a speech recognition service for use in noisy environments. This means that the recognition performance in the target noise environment must be investigated. One approach is to estimate the recognition performance from a distortion val...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on audio, speech, and language processing speech, and language processing, 2006-11, Vol.14 (6), p.2006-2013 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | It is essential to ensure quality of service (QoS) when offering a speech recognition service for use in noisy environments. This means that the recognition performance in the target noise environment must be investigated. One approach is to estimate the recognition performance from a distortion value, which represents the difference between noisy speech and its original clean version. Previously, estimation methods using the segmental signal-to-noise ratio (SNRseg), the cepstral distance (CD), and the perceptual evaluation of speech quality (PESQ) have been proposed. However, their estimation accuracy has not been verified for the case when a noise reduction algorithm is adopted as a preprocessing stage in speech recognition. We, therefore, evaluated the effectiveness of these distortion measures by experiments using the AURORA-2J connected digit recognition task and four different noise reduction algorithms. The results showed that in each case the distortion measure correlates well with the word accuracy when the estimators used are optimized for each individual noise reduction algorithm. In addition, it was confirmed that when a single estimator, optimized for all the noise reduction algorithms, is used, the PESQ method gives a more accurate estimate than SNRseg and CD. Furthermore, we have proposed the use of artificial voice of several seconds duration instead of a large amount of real speech and confirmed that a relatively accurate estimate can be obtained by using the artificial voice |
---|---|
ISSN: | 1558-7916 2329-9290 1558-7924 2329-9304 |
DOI: | 10.1109/TASL.2006.883254 |