Speaker Identification in Multi-Talker Overlapping Speech Using Neural Networks

Although numerous works have studied the problem of automatic speaker identification (SID), there are only few works on the SID for overlapping speech, and none of them consider the case of more than two simultaneous speakers. Recognizing that overlapping speech occurs frequently in real-life scenar...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2020, Vol.8, p.134868-134879
Hauptverfasser: Tran, Van-Thuan, Tsai, Wei-Ho
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Although numerous works have studied the problem of automatic speaker identification (SID), there are only few works on the SID for overlapping speech, and none of them consider the case of more than two simultaneous speakers. Recognizing that overlapping speech occurs frequently in real-life scenarios, such as in meetings or debates, this work investigates the methods for overlapping SID (OSID) that can determine identities in the overlapping speech from up to five simultaneous speakers. We propose two deep-learning OSID systems, one is two-stage and the other is single-stage. The two-stage system determines the number of simultaneous speakers firstly, followed by identifying the speaker(s). The single-stage system uses a single classifier to perform OSID directly, which is slightly more computationally efficient than the two-stage system. Our experiments show that the two-stage OSID system achieves better identification accuracy than that of the single-stage system. In addition, both the OSID systems based on one-dimensional convolutional neural networks (1DCNN) perform better than the systems based on multilayer perceptron (MLP) and Gaussian mixture models (GMMs). The proposed 1DCNN-based two-stage OSID system achieves 98.55% OSID accuracy for the clean audio data containing up to five simultaneous speakers. In more challenging experimental conditions involving both background noises and high overlapping energy ratios, the system still attained accuracies of above 90%.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2020.3009987