Convolutional gated recurrent unit networks based real-time monaural speech enhancement

Deep-learning based speech enhancement included many applications like improving speech intelligibility and perceptual quality. There are many methods which focus on amplitude spectrum enhancement. In the existing models, computation of the complex layer is huge which leads to a very big challenge t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Multimedia tools and applications 2023-12, Vol.82 (29), p.45717-45732
Hauptverfasser: Vanambathina, Sunny Dayal, Anumola, Vaishnavi, Tejasree, Ponnapalli, Divya, R., Manaswini, B.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Deep-learning based speech enhancement included many applications like improving speech intelligibility and perceptual quality. There are many methods which focus on amplitude spectrum enhancement. In the existing models, computation of the complex layer is huge which leads to a very big challenge to the device. DFT data is complex valued, so computation is difficult since we need to deal with the both real and imaginary parts of the signal at the same time. To reduce the computation, some researchers use the variants of STFT as input, such as amplitude/energy spectrum, Log-Mel spectrum, etc. They all enhance amplitude spectrum without estimating clean phase, this would limit the enhancement performance. In the proposed method DCT is used which is real-valued transformation without information lost and contains implicit phase. This avoids the problem of manually design a complex network to estimate the explicit phase and it will improve the enhancement performance. More research have done on phase spectrum estimation directly and indirectly, but it is not ideal. Recently, complex valued models are proposed like deep complex convolution recurrent network (DCCRN). The computation of the model is very huge. So a Deep Cosine transform convolutional Gated recurrent Unit (DCTCGRU) is proposed to reduce the complexity and improve further performance. GRU can well model the correlation between adjacent frames of noisy speech. The results from the experiment show that DCTCGRU achieves better results in terms of SNR, PESQ and STOI compared with the state-of-the-art algorithms.
ISSN:1380-7501
1573-7721
DOI:10.1007/s11042-023-15639-9