An empirical study on analysis window functions for text-independent speaker recognition
This paper describes the effect of analysis window functions on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate a...
Gespeichert in:
Veröffentlicht in: | International journal of speech technology 2023-03, Vol.26 (1), p.211-220 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This paper describes the effect of
analysis window functions
on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate analysis window function is required to extract frames from the complete speech signal of a speaker prior to STFA. The number of frames are consider as the number of MFCC feature vectors of a speaker which uniquely represents the speaker in feature space (domain). For the recognition purpose
Vector Quantization
(VQ) and/or
Gaussian Mixture Model
(GMM) and/or
Universal Background Model GMM
(UBM-GMM) based classifiers are used and a comparative study is made. Generally in state-of-the-art MFCC feature vector extraction, Hamming (in some places abbreviated as Ham in this paper) window function is used, but here we also examine the effect of other window functions like rectangular window, Hann window, B-spline windows, polynomial windows, adjustable windows, hybrid windows and Lanczos window in SR. In the present paper, we briefly describe the analysis window functions and try to evaluate text-independent speaker identification (SI). We also use voice activity detector (VAD) to discard the silence frames before STFA. Indeed, silence frames removal leads to the better performance of SR because MFCC of silent frames make the MFCC feature space intrinsic (MFCC with impurity). Here IITG MV SR database contains speech signal of speakers recorded by different devices, namely, D01, H01, T01, M01 and M02, in different environment, different language, different session. This is the reason for calling the database multi variability. It is observed that VQ classifier performs better than other GMM based classifiers for this database and the classifiers VQ-GMM, VQ-UGM-GMM and the combination of them suffers from singularity problem of covariance matrix. So we evaluate the performance of device D01 for all the classifiers and the three classifiers namely, GMM, UBM-GMM and VQ are used for the remaining four recording devices, H01, T01, M01, M02 because except these three classifiers, all other classifiers suffer from singularity problem of covariance matrix in SI. It is observed that VQ provide the highest accuracy for all the devices. |
---|---|
ISSN: | 1381-2416 1572-8110 |
DOI: | 10.1007/s10772-023-10024-1 |