Text-independent speaker recognition based on adaptive course learning loss and deep residual network

Text-independent speaker recognition is widely used in identity recognition that has a wide spectrum of applications, such as criminal investigation, payment certification, and interest-based customer services. In order to improve the recognition ability of log filter bank feature vectors, a method...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:EURASIP journal on advances in signal processing 2021-07, Vol.2021 (1), p.1-16, Article 45
Hauptverfasser: Zhong, Qinghua, Dai, Ruining, Zhang, Han, Zhu, Yongsheng, Zhou, Guofu
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Text-independent speaker recognition is widely used in identity recognition that has a wide spectrum of applications, such as criminal investigation, payment certification, and interest-based customer services. In order to improve the recognition ability of log filter bank feature vectors, a method of text-independent speaker recognition based on deep residual networks model was proposed in this paper. The deep residual network was composed of a residual network (ResNet) and a convolutional attention statistics pooling (CASP) layer. The CASP layer could aggregate frame-level features from the ResNet into an utterance-level features. Extracting speech features for each speaker using deep residual networks was a promising direction to explore, and a straightforward solution was to train the discriminative feature extraction network by using a margin-based loss function. However, a margin-based loss function often has certain limitations, such as the margins between different categories were set to be the same and fixed. Thus, we used an adaptive curriculum learning loss (ACLL) to address the problem and introduce two different margin-based losses for this problem, i.e., AM-Softmax and AAM-Softmax. The proposed method was applied to a large-scale VoxCeleb2 dataset for extensive text-independent speaker recognition experiments, and average equal error rate (EER) could achieve 1.76% on VoxCeleb1 test dataset, 1.91% on VoxCeleb1-E test dataset, and 3.24% on VoxCeleb1-H test dataset. Compared with related speaker recognition methods, EER was improved by 1.11% on VoxCeleb1 test dataset, 1.04% on VoxCeleb1-E test dataset, and 1.69% on VoxCeleb1-H test dataset.
ISSN:1687-6180
1687-6172
1687-6180
DOI:10.1186/s13634-021-00762-2