Causal speech enhancement using dynamical-weighted loss and attention encoder-decoder recurrent neural network

Speech enhancement (SE) reduces background noise signals in target speech and is applied at the front end in various real-world applications, including robust ASRs and real-time processing in mobile phone communications. SE systems are commonly integrated into mobile phones to increase quality and i...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	PloS one 2023-05, Vol.18 (5), p.e0285629-e0285629
Hauptverfasser:	Peracha, Fahad Khalil, Khattak, Muhammad Irfan, Salem, Nema, Saleem, Nasir
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Analysis Attention Automatic speech recognition Background noise Biology and Life Sciences Cell phones Cellular telephones Coders Computer and Information Sciences Deep learning Encoders-Decoders Engineering and Technology Error analysis Evaluation Intelligibility Latency Long short-term memory Memory, Long-Term Mobile phones Modelling Neural networks Neural Networks, Computer Noise Noise reduction Optimization Physical Sciences Real time Recurrent neural networks Short term memory Social Sciences Speech Speech enhancement Speech Intelligibility Speech Perception Speech processing Speech production Speech recognition Technology application Time Time-frequency analysis Voice communication Voice quality Voice recognition Weight loss
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Speech enhancement (SE) reduces background noise signals in target speech and is applied at the front end in various real-world applications, including robust ASRs and real-time processing in mobile phone communications. SE systems are commonly integrated into mobile phones to increase quality and intelligibility. As a result, a low-latency system is required to operate in real-world applications. On the other hand, these systems need efficient optimization. This research focuses on the single-microphone SE operating in real-time systems with better optimization. We propose a causal data-driven model that uses attention encoder-decoder long short-term memory (LSTM) to estimate the time-frequency mask from a noisy speech in order to make a clean speech for real-time applications that need low-latency causal processing. The encoder-decoder LSTM and a causal attention mechanism are used in the proposed model. Furthermore, a dynamical-weighted (DW) loss function is proposed to improve model learning by varying the weight loss values. Experiments demonstrated that the proposed model consistently improves voice quality, intelligibility, and noise suppression. In the causal processing mode, the LSTM-based estimated suppression time-frequency mask outperforms the baseline model for unseen noise types. The proposed SE improved the STOI by 2.64% (baseline LSTM-IRM), 6.6% (LSTM-KF), 4.18% (DeepXi-KF), and 3.58% (DeepResGRU-KF). In addition, we examine word error rates (WERs) using Google's Automatic Speech Recognition (ASR). The ASR results show that error rates decreased from 46.33% (noisy signals) to 13.11% (proposed) 15.73% (LSTM), and 14.97% (LSTM-KF).
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0285629