A Joint Approach for Single-Channel Speaker Identification and Speech Separation

In this paper, we present a novel system for joint speaker identification and speech separation. For speaker identification a single-channel speaker identification algorithm is proposed which provides an estimate of signal-to-signal ratio (SSR) as a by-product. For speech separation, we propose a si...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on audio, speech, and language processing speech, and language processing, 2012-11, Vol.20 (9), p.2586-2601
Hauptverfasser:	Mowlaee, P., Saeidi, R., Christensen, M. G., Zheng-Hua Tan, Kinnunen, T., Franti, P., Jensen, S. H.
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Adaptation models Algorithms Applied sciences Automatic speech recognition BSS EVAL Byproducts Detection, estimation, filtering, equalization, prediction Exact sciences and technology Hidden Markov models Information, signal and communications theory Intelligibility Miscellaneous Separation Signal and communications theory Signal processing Signal, noise single-channel speech separation sinusoidal modeling speaker identification Speech Speech coding Speech processing Speech recognition State of the art Studies Telecommunications and information theory Vectors Voice recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In this paper, we present a novel system for joint speaker identification and speech separation. For speaker identification a single-channel speaker identification algorithm is proposed which provides an estimate of signal-to-signal ratio (SSR) as a by-product. For speech separation, we propose a sinusoidal model-based algorithm. The speech separation algorithm consists of a double-talk/single-talk detector followed by a minimum mean square error estimator of sinusoidal parameters for finding optimal codevectors from pre-trained speaker codebooks. In evaluating the proposed system, we start from a situation where we have prior information of codebook indices, speaker identities and SSR-level, and then, by relaxing these assumptions one by one, we demonstrate the efficiency of the proposed fully blind system. In contrast to previous studies that mostly focus on automatic speech recognition (ASR) accuracy, here, we report the objective and subjective results as well. The results show that the proposed system performs as well as the best of the state-of-the-art in terms of perceived quality while its performance in terms of speaker identification and automatic speech recognition results are generally lower. It outperforms the state-of-the-art in terms of intelligibility showing that the ASR results are not conclusive. The proposed method achieves on average, 52.3% ASR accuracy, 41.2 points in MUSHRA and 85.9% in speech intelligibility.
ISSN:	1558-7916 2329-9290 1558-7924 2329-9304
DOI:	10.1109/TASL.2012.2208627