Supervised Attention Multi-Scale Temporal Convolutional Network for monaural speech enhancement

Speech signals are often distorted by reverberation and noise, with a widely distributed signal-to-noise ratio (SNR). To address this, our study develops robust, deep neural network (DNN)-based speech enhancement methods. We reproduce several DNN-based monaural speech enhancement methods and outline...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	EURASIP journal on audio, speech, and music processing speech, and music processing, 2024-04, Vol.2024 (1), p.20-16, Article 20
Hauptverfasser:	Zhang, Zehua, Zhang, Lu, Zhuang, Xuyi, Qian, Yukun, Wang, Mingjiang
Format:	Artikel
Sprache:	eng
Schlagworte:	Acoustics Artificial neural networks Complex compressed spectrum Complex ratio mask Engineering Engineering Acoustics Feature extraction Intelligibility Mathematics in Music Methodology Modules Monaural speech enhancement Multi-scale temporal convolutional network Noise reduction Signal to noise ratio Signal,Image and Speech Processing Speech Speech processing Supervised attention
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Speech signals are often distorted by reverberation and noise, with a widely distributed signal-to-noise ratio (SNR). To address this, our study develops robust, deep neural network (DNN)-based speech enhancement methods. We reproduce several DNN-based monaural speech enhancement methods and outline a strategy for constructing datasets. This strategy, validated through experimental reproductions, has effectively enhanced the denoising efficiency and robustness of the models. Then, we propose a causal speech enhancement system named Supervised Attention Multi-Scale Temporal Convolutional Network (SA-MSTCN). SA-MSTCN extracts the complex compressed spectrum (CCS) for input encoding and employs complex ratio masking (CRM) for output decoding. The supervised attention module, a lightweight addition to SA-MSTCN, guides feature extraction. Experiment results show that the supervised attention module effectively improves noise reduction performance with a minor increase in computational cost. The multi-scale temporal convolutional network refines the perceptual field and better reconstructs the speech signal. Overall, SA-MSTCN not only achieves state-of-the-art speech quality and intelligibility compared to other methods but also maintains stable denoising performance across various environments.
ISSN:	1687-4722 1687-4714 1687-4722
DOI:	10.1186/s13636-024-00341-x