Analysis-by-synthesis based training target extraction of the DNN for noise masking

•An ideal real-valued ratio mask (IRVRM) extraction method is proposed based on the analysis-by-synthesis (ABS) to import spectral dependency. In the synthesis process, the enhanced speech is obtained by inverse short-time Fourier transform of the masked spectrum, whereas in the analysis process, th...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Speech communication 2022-10, Vol.144, p.26-41
Hauptverfasser: Cui, Zihao, Bao, Changchun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•An ideal real-valued ratio mask (IRVRM) extraction method is proposed based on the analysis-by-synthesis (ABS) to import spectral dependency. In the synthesis process, the enhanced speech is obtained by inverse short-time Fourier transform of the masked spectrum, whereas in the analysis process, the IRVRM is obtained by maximizing the speech quality of the reconstructed speech from mask space.•The ABS loop method is proposed to reduce the computational complexity of the ABS-based mask design by loop searching in the iteratively generated subspace.•The generated subspace in this paper is linear spanned by the projection of a specific basis matrix. The specific basis matrix is the descending direction of the mean square error between the reconstructed speech and clean speech. In conventional speech enhancement methods, the target of noise mask in the time-frequency domain is based on deep neural networks (DNN), such as ideal ratio mask and phase-sensitive mask, in which they do not consider the dependency of spectrum. In this paper, an ideal real-valued ratio mask (IRVRM) extraction method is proposed based on the analysis-by-synthesis (ABS) for utilizing the dependency of spectrum. In the synthesis process, the enhanced speech is obtained by inverse short-time Fourier transform (ISTFT) of the masked spectrum, whereas in the analysis process, the IRVRM is determined by maximizing speech quality of the reconstructed speech from mask space. The ABS loop algorithm is proposed to reduce computational complexity, namely, the best mask in the specifically generated subspace is conducted in each loop. After the ABS loop, the approximated IRVRM is conducted. This IRVRM is further utilized as the training target of the DNN. The experimental results show that when the extracted IRVRM with the ABS loop is employed as the training target of the DNN, the speech quality is effectively improved in the DNN-based noise masking.
ISSN:0167-6393
1872-7182
DOI:10.1016/j.specom.2022.08.006