An Unsupervised Approach to Cochannel Speech Separation

Cochannel (two-talker) speech separation is predominantly addressed using pretrained speaker dependent models. In this paper, we propose an unsupervised approach to separating cochannel speech. Our approach follows the two main stages of computational auditory scene analysis: segmentation and groupi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on audio, speech, and language processing speech, and language processing, 2013-01, Vol.21 (1), p.122-131
Hauptverfasser:	Hu, Ke, Wang, DeLiang
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithm design and analysis Applied sciences Clustering Clustering algorithms cochannel speech separation Computational auditory scene analysis (CASA) Computational modeling Exact sciences and technology Hidden Markov models Image processing Information, signal and communications theory Mathematical models Natural language processing Scene analysis Segments Separation sequential grouping Signal and communications theory Signal processing Signal representation. Spectral analysis Signal to noise ratio Signal, noise Speech Studies Telecommunications and information theory Time frequency analysis Transaction processing unsupervised clustering unvoiced speech segregation
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Cochannel (two-talker) speech separation is predominantly addressed using pretrained speaker dependent models. In this paper, we propose an unsupervised approach to separating cochannel speech. Our approach follows the two main stages of computational auditory scene analysis: segmentation and grouping. For voiced speech segregation, the proposed system utilizes a tandem algorithm for simultaneous grouping and then unsupervised clustering for sequential grouping. The clustering is performed by a search to maximize the ratio of between- and within-group speaker distances while penalizing within-group concurrent pitches. To segregate unvoiced speech, we first produce unvoiced speech segments based on onset/offset analysis. The segments are grouped using the complementary binary masks of segregated voiced speech. Despite its simplicity, our approach produces significant SNR improvements across a range of input SNR. The proposed system yields competitive performance in comparison to other speaker-independent and model-based methods.
ISSN:	1558-7916 2329-9290 1558-7924 2329-9304
DOI:	10.1109/TASL.2012.2215591