ASTT: acoustic spatial-temporal transformer for short utterance speaker recognition
Text-independent Short Utterance Speaker Recognition (SUSR) is of importance for the purpose of person authentication. However, it is a great challenge for the speaker recognition with a short utterance, which is defined as the duration of a speech is shorter than 5 seconds. To address this problem,...
Gespeichert in:
Veröffentlicht in: | Multimedia tools and applications 2023-09, Vol.82 (21), p.33039-33061 |
---|---|
Hauptverfasser: | , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Text-independent Short Utterance Speaker Recognition (SUSR) is of importance for the purpose of person authentication. However, it is a great challenge for the speaker recognition with a short utterance, which is defined as the duration of a speech is shorter than 5 seconds. To address this problem, an Acoustic Spatial-Temporal Transformer (ASTT) method is proposed to alleviate the bottleneck of short utterance speaker recognition. The contribution of the proposed ASTT method can be expressed as two parts. On the one hand, the ASTT method has a simple and elegant structure. Without convolutional structures, the ASTT method is purely based on an attention mechanism combining temporal and spatial features of speakers with knowledge migration on the ImageNet. On the other hand, the ASTT method has good performance on text-independent short utterance speaker recognition. Extensive experiments demonstrate that the proposed ASTT method outperforms state-of-the-art methods on audio dataset with no more than 5-second speech clips with equal error rate (EER) of 6.93% and minimum detection cost function (minDCF) of 0.487, which has a relative improvement of 41.8% and 33.7%, respectively. Furthermore, the qualitative and quantitative analysis proves the effectiveness and efficiency of proposed ASTT, which can not only accelerate model converging, but also reduce the size of training data by 90%. |
---|---|
ISSN: | 1380-7501 1573-7721 |
DOI: | 10.1007/s11042-023-14657-x |