A joint feature refining and multi-resolution feature learning approach for bandwidth extension of speech coded by EVS codec

•In order to mitigate the influence of channel and spatial redundancy features on output speech in convolutional neural networks, we propose a SR module to differentiate high and low frequency feature information by setting thresholds and introducing masking.•Additionally, we introduce CR module, en...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Applied acoustics 2024-06, Vol.222, p.110052, Article 110052
Hauptverfasser:	Xu, Chundong, Tan, Guowu, Ying, Dongwen
Format:	Artikel
Sprache:	eng
Schlagworte:	Channel reconstruction module Multi-resolution representation Spatial reconstruction module Speech bandwidth extension Time–frequency loss function
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•In order to mitigate the influence of channel and spatial redundancy features on output speech in convolutional neural networks, we propose a SR module to differentiate high and low frequency feature information by setting thresholds and introducing masking.•Additionally, we introduce CR module, enabling the efficient capture of the important channel information according to the relationship between different channels.•We introduced a time–frequency (T-F) loss function, including time-domain loss and frequency-domain loss based on Mel-spectrum and multi-resolution representation to optimize the network. The bandwidth of the speech signals is often limited in speech communications due to the specifications of the standardized codecs or insufficient bitrates. So, we present a bandwidth extension (BWE) framework of speech signals coded by the enhanced voice services (EVS) codec. Previous studies on speech bandwidth extension based on convolutional neural network (CNN) which exist channel and spatial redundancy feature information. This work proposes an end-to-end architecture which combines novel channel and spatial reconstruction module. Specifically, the spatial reconstruction module utilizes a mask to distinguish high-frequency and low-frequency features. Group convolutions is then employed to enhance high-frequency features and suppress low-frequency features. The channel reconstruction module is introduced to reduce unnecessary feature information. Additionally, we introduce a novel time–frequency loss function, incorporating time-domain loss and frequency-domain loss based on Mel-spectrum and multi-resolution representation, to optimize the network. To assess the model and loss function’s performance, we conducted experiments on subdatasets encoded at rates of 6.6kbps, 7.2kbps, and 8kbps, respectively. The experimental results demonstrated that our proposed model surpassed other baseline models in terms of LSD, SNR, PESQ, and MOS scores.
ISSN:	0003-682X 1872-910X
DOI:	10.1016/j.apacoust.2024.110052