Improving generative adversarial networks for speech enhancement through regularization of latent representations

•We propose a new network structure and a new loss function, which is advantageous to our model in speech enhancement under low signal to noise (SNR) environments and low resource environments.•Different from most network structures, t he new network enables us to make full use of the information ca...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Speech communication 2020-04, Vol.118, p.1-9
Hauptverfasser: Yang, Fan, Wang, Ziteng, Li, Junfeng, Xia, Risheng, Yan, Yonghong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•We propose a new network structure and a new loss function, which is advantageous to our model in speech enhancement under low signal to noise (SNR) environments and low resource environments.•Different from most network structures, t he new network enables us to make full use of the information carried by the clean speech signals.•The new loss allows us to obtain a more accurate speech feature re presentation from a noisy speech signal and improves the optimization direction of the network.•We explain the reasons for the excellent performance of the proposed model.•Extensive experiments demonstrate the generality of our model in a variety of speech enhancement cases. Speech enhancement aims to improve the quality and intelligibility of speech signals, which is a challenging task in adverse environments. Speech enhancement generative adversarial network (SEGAN) that adopted a generative adversarial network (GAN) for speech enhancement achieved promising results. In this paper, a new network architecture and loss function based on SEGAN are proposed for speech enhancement. Different from most network structures applied in this field, the new network, called high-level GAN (HLGAN), uses parallel noisy and clean speech signals as input in the training phase instead of only noisy speech signals, which enables us to make full use of the information carried by the clean speech signals. Additionally, we introduce a new supervised speech representation loss, also known as high-level loss, in the middle hidden layer of the generative network. The high-level loss function is advantageous to HLGAN in speech enhancement under low signal-to-noise (SNR) environments and low-resource environments. We evaluate the performance of HLGAN over a wide range of experiments, in which our model produces significant improvements. Extensive experiments further demonstrate the generality of our model in a variety of speech enhancement cases. The issue of SEGAN losing speech components while removing noise in low SNR environments is improved. In addition, HLGAN can effectively enhance the speech signals of two low-resource languages simultaneously. The reasons for the superior performance of HLGAN are discussed.
ISSN:0167-6393
1872-7182
DOI:10.1016/j.specom.2020.02.001