Noise-aware network with shared channel-attention encoder and joint constraint for noisy speech separation

Recently, significant progress has been made in the end-to-end single-channel speech separation in clean environments. For noisy speech separation, existing research mainly uses deep neural networks to implicitly process the noise in speech signals, which does not fully utilize the impact of noise r...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Digital signal processing 2025-02, Vol.157, p.104891, Article 104891
Hauptverfasser: Sun, Linhui, Zhou, Xiaolong, Gong, Aifei, Ye, Lei, Li, Pingan, Siong Chng, Eng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Recently, significant progress has been made in the end-to-end single-channel speech separation in clean environments. For noisy speech separation, existing research mainly uses deep neural networks to implicitly process the noise in speech signals, which does not fully utilize the impact of noise reconstruction errors on network training. We propose a lightweight noise-aware network with shared channel-attention encoder and joint constraint, named NSCJnet, which aims to improve the speech separation system performance in noisy environments. Firstly, to reduce network parameters, the model uses a parameter sharing channel attention encoder to convert noisy speech signals into a feature space. In addition, the channel attention layer (CAlayer) in encoder enhances the network's representational capacity and separation performance in noisy environments by calculating different weights of the filters in the convolution. Secondly, to make the network converge quickly, we regard noise as an estimation target of equal significance to speech, which compel the network to separate residual noise from the estimated speech, effectively suppressing lingering noise within the speech signal. Furthermore, by integrating a multi-resolution frequency constraint into the time domain loss, we introduce a weighted time-frequency joint loss constraint, empowering the network to acquire information across both dimensions to conducive to separating mixed speech with noise. It automatically strengthens important features for separation and suppresses unimportant ones during the learning process. The results on the noisy WHAM! dataset and the noisy Libri2Mix dataset show that our method has less computational complexity, and outperforms some advanced methods in various speech quality and intelligibility metrics.
ISSN:1051-2004
DOI:10.1016/j.dsp.2024.104891