A Two-Stage Phase-Aware Approach for Monaural Multi-Talker Speech Separation

The simultaneous utterances impact the ability of both the hearing-impaired persons and automatic speech recognition systems. Recently, deep neural networks have dramatically improved the speech separation performance. However, most previous works only estimate the speech magnitude and use the mixtu...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEICE Transactions on Information and Systems 2020/07/01, Vol.E103.D(7), pp.1732-1743
Hauptverfasser:	YIN, Lu, LI, Junfeng, YAN, Yonghong, AKAGI, Masato
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms amplitude estimation Artificial neural networks Automatic speech recognition Deep learning Distortion Distortion of speech signal Error compensation Hearing disorders mask estimation Neural networks Phase distortion phase recovery Recovery Separation Signal distortion Speech Speech recognition speech separation Voice recognition
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The simultaneous utterances impact the ability of both the hearing-impaired persons and automatic speech recognition systems. Recently, deep neural networks have dramatically improved the speech separation performance. However, most previous works only estimate the speech magnitude and use the mixture phase for speech reconstruction. The use of the mixture phase has become a critical limitation for separation performance. This study proposes a two-stage phase-aware approach for multi-talker speech separation, which integrally recovers the magnitude as well as the phase. For the phase recovery, Multiple Input Spectrogram Inversion (MISI) algorithm is utilized due to its effectiveness and simplicity. The study implements the MISI algorithm based on the mask and gives that the ideal amplitude mask (IAM) is the optimal mask for the mask-based MISI phase recovery, which brings less phase distortion. To compensate for the error of phase recovery and minimize the signal distortion, an advanced mask is proposed for the magnitude estimation. The IAM and the proposed mask are estimated at different stages to recover the phase and the magnitude, respectively. Two frameworks of neural network are evaluated for the magnitude estimation on the second stage, demonstrating the effectiveness and flexibility of the proposed approach. The experimental results demonstrate that the proposed approach significantly minimizes the distortions of the separated speech.
ISSN:	0916-8532 1745-1361
DOI:	10.1587/transinf.2019EDP7259