DS-TDNN: Dual-Stream Time-Delay Neural Network With Global-Aware Filter for Speaker Verification

Conventional time-delay neural networks (TDNNs) struggle to handle long-range context, their ability to represent speaker information is therefore limited for long utterances. Existing solutions either depend on increasing model complexity or try to strike a balance between local features and global...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2024, Vol.32, p.2814-2827
Hauptverfasser:	Li, Yangfu, Gan, Jiapan, Lin, Xiaodan, Qiu, Yingqiang, Zhan, Hongjian, Tian, Hui
Format:	Artikel
Sprache:	eng
Schlagworte:	Audio signals Complexity Complexity theory Computational efficiency Context Context modeling Dual-stream network Fourier transforms global context Information filters Long short term memory Low-pass filters Neural networks Optical filters Regularization Regularization methods Sound filters Speech processing text-independent speaker verification time-delay neural network Verification
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Conventional time-delay neural networks (TDNNs) struggle to handle long-range context, their ability to represent speaker information is therefore limited for long utterances. Existing solutions either depend on increasing model complexity or try to strike a balance between local features and global context to address this issue. To effectively leverage the long-term dependencies of audio signals and constrain model complexity, we introduce a novel module called Global-aware Filter layer (GF layer) in this work, which employs a set of learnable transform-domain filters between a 1D discrete Fourier transform and its inverse transform to capture global context. Additionally, we develop a dynamic filtering strategy and a sparse regularization method to enhance the performance of the GF layer and prevent overfitting. Based on the GF layer, we present a dual-stream TDNN architecture called DS-TDNN for automatic speaker verification (ASV), which utilizes two unique branches to extract both local and global features in parallel and employs an efficient strategy to fuse different-scale information. Experiments on the Voxceleb and SITW databases demonstrate that the DS-TDNN achieves a relative improvement of 10% together with a relative decline of 20% in computational cost over the ECAPA-TDNN in the speaker verification task. This improvement becomes more evident as the utterance's duration grows. Furthermore, the DS-TDNN also beats popular deep residual models and attention-based systems on utterances of arbitrary length.
ISSN:	2329-9290 2329-9304
DOI:	10.1109/TASLP.2024.3402072