Multi-target ensemble learning based speech enhancement with temporal-spectral structured target

•A novel structured multi-objective integrated learning framework is proposed to improve performance in speech enhancement.•The structured targets provide more information for the network and reduce the problem of information loss.•The influence of dimension on structured IRM is greater than that on...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Applied acoustics 2023-03, Vol.205, p.109268, Article 109268
Hauptverfasser:	Wang, Wenbo, Guo, Weiwei, Liu, Houguang, Yang, Jianhua, Liu, Songyong
Format:	Artikel
Sprache:	eng
Schlagworte:	Multi-target ensemble learning Sparse nonnegative matrix factorization Speech enhancement Temporal-spectral structured target
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•A novel structured multi-objective integrated learning framework is proposed to improve performance in speech enhancement.•The structured targets provide more information for the network and reduce the problem of information loss.•The influence of dimension on structured IRM is greater than that on the speech spectrum.•The proposed framework achieves good results for speech enhancement, especially in nonstationary noisy environments. Recently, deep neural network (DNN)-based speech enhancement has shown considerable success, and mapping-based and masking-based are the two most commonly used methods. However, these methods do not consider the spectrum structures of signal. In this paper, a novel structured multi-target ensemble learning (SMTEL) framework is proposed, which uses target temporal-spectral structures to improve speech quality and intelligibility. First, the basis matrices of clean speech, noise, and ideal ratio mask (IRM) are captured by the sparse nonnegative matrix factorization, which contain the basic structures of the signal. Second, the basis matrices are co-trained with a multi-target DNN to estimate the activation matrices instead of directly estimating the targets. Then a joint training single layer perceptron is proposed to integrate the two targets and further improve speech quality and intelligibility. The sequential floating forward selection method is used to systematically analyze the impact of the integrated targets on enhanced performance, and analyze the effect of the target weights on the results. Finally, the proposed method with progressive learning is combined to improve the enhanced performance. Systematic experiments on the UW/NU corpus show that the proposed method achieves the best enhancement effect in the case of low network cost and complexity, especially in visible nonstationary noise environment. Compared with the target integration method which does not use structured targets and the long short-term memory masking method, the speech quality of the proposed method is improved by 25.6 % and 29.2 % of restaurant noise, and the speech intelligibility is improved by 35.5 % and 15.8 %, respectively.
ISSN:	0003-682X 1872-910X
DOI:	10.1016/j.apacoust.2023.109268