Speech Enhancement Using a Risk Estimation Approach

•In this paper, we develop a risk estimation framework for speech enhancement, where we optimize an unbiased estimate of the risk instead of the actual risk. The estimated risk is expressed solely as a function of the noisy observations and noise statistics. Hence, the denoiser obtained by minimizin...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Speech communication 2020-01, Vol.116, p.12-29
Hauptverfasser: Sadasivan, Jishnu, Seelamantula, Chandra Sekhar, Muraka, Nagarjuna Reddy
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•In this paper, we develop a risk estimation framework for speech enhancement, where we optimize an unbiased estimate of the risk instead of the actual risk. The estimated risk is expressed solely as a function of the noisy observations and noise statistics. Hence, the denoiser obtained by minimizing the risk estimate does not require the clean speech prior. The stateof- the-art image denoising techniques optimize Steins unbiased risk estimate (SURE), which is an unbiased estimate of MSE, to obtain the optimum denoising function. Even though the MSE is a widely and successfully used distortion measure for signal denoising, in speech processing applications, distortion measures such as Itakura-Saito (IS), hyperbolic-cosine (cosh), weighted cosh, are known to be more perceptually relevant than MSE. Considering this into account, in this paper, we solve the speech denoising problem within the framework of perceptual risk estimation (wherein we derive unbiased estimates of speech-specific perceptual distortion measures and minimize them to obtain the corresponding denoising functions). We employ a DCT-domain pointwise shrinkage estimator for denoising where the optimum shrinkage estimator is obtained by minimizing the perceptual risk estimate. We evaluate the performance of the risk estimation-based techniques objective assessment in terms of segmental signal-to-noise ratio (SSNR), perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), signal-to-distortion ratio (SDR), and subjective assessment by means of listening tests. Validation on several speech signals in real-world nonstationary noise scenarios and comparisons with benchmark techniques showed that, for input SNR greater than 5 dB, the proposed method results in better denoising performance than several benchmarking techniques. Among the risk estimation-based techniques, the quality of the denoised speech is higher (measured in terms of PESQ and subjective listening scores) for perceptual risk-based techniques than the MSE-based technique. Further, we want to emphasize that the proposed methodology is relatively simpler, from an implementation perspective since the shrinkage estimators are easy to compute, and does not require training making it ideal for deployment in practical applications, particularly for those involving hearing aids, mobile devices, etc. The goal in speech enhancement is to obtain an estimate of clean speech starting from the noisy signal by mi
ISSN:0167-6393
1872-7182
DOI:10.1016/j.specom.2019.11.001