ASTT: acoustic spatial-temporal transformer for short utterance speaker recognition

Text-independent Short Utterance Speaker Recognition (SUSR) is of importance for the purpose of person authentication. However, it is a great challenge for the speaker recognition with a short utterance, which is defined as the duration of a speech is shorter than 5 seconds. To address this problem,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Multimedia tools and applications 2023-09, Vol.82 (21), p.33039-33061
Hauptverfasser:	Wu, Xing, Li, Ruixuan, Deng, Bin, Zhao, Ming, Du, Xingyue, Wang, Jianjia, Ding, Kai
Format:	Artikel
Sprache:	eng
Schlagworte:	Acoustics Artificial intelligence Computer Communication Networks Computer Science Cost function Data Structures and Information Theory Efficiency Methods Multimedia Multimedia Information Systems Performance evaluation Qualitative analysis Smart houses Special Purpose and Application-Based Systems Speech Speech recognition Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Text-independent Short Utterance Speaker Recognition (SUSR) is of importance for the purpose of person authentication. However, it is a great challenge for the speaker recognition with a short utterance, which is defined as the duration of a speech is shorter than 5 seconds. To address this problem, an Acoustic Spatial-Temporal Transformer (ASTT) method is proposed to alleviate the bottleneck of short utterance speaker recognition. The contribution of the proposed ASTT method can be expressed as two parts. On the one hand, the ASTT method has a simple and elegant structure. Without convolutional structures, the ASTT method is purely based on an attention mechanism combining temporal and spatial features of speakers with knowledge migration on the ImageNet. On the other hand, the ASTT method has good performance on text-independent short utterance speaker recognition. Extensive experiments demonstrate that the proposed ASTT method outperforms state-of-the-art methods on audio dataset with no more than 5-second speech clips with equal error rate (EER) of 6.93% and minimum detection cost function (minDCF) of 0.487, which has a relative improvement of 41.8% and 33.7%, respectively. Furthermore, the qualitative and quantitative analysis proves the effectiveness and efficiency of proposed ASTT, which can not only accelerate model converging, but also reduce the size of training data by 90%.
ISSN:	1380-7501 1573-7721
DOI:	10.1007/s11042-023-14657-x