An empirical study on analysis window functions for text-independent speaker recognition

This paper describes the effect of analysis window functions on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate a...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of speech technology 2023-03, Vol.26 (1), p.211-220
Hauptverfasser: Barai, Bidhan, Das, Nibaran, Basu, Subhadip, Nasipuri, Mita
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 220
container_issue 1
container_start_page 211
container_title International journal of speech technology
container_volume 26
creator Barai, Bidhan
Das, Nibaran
Basu, Subhadip
Nasipuri, Mita
description This paper describes the effect of analysis window functions on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate analysis window function is required to extract frames from the complete speech signal of a speaker prior to STFA. The number of frames are consider as the number of MFCC feature vectors of a speaker which uniquely represents the speaker in feature space (domain). For the recognition purpose Vector Quantization (VQ) and/or Gaussian Mixture Model (GMM) and/or Universal Background Model GMM (UBM-GMM) based classifiers are used and a comparative study is made. Generally in state-of-the-art MFCC feature vector extraction, Hamming (in some places abbreviated as Ham in this paper) window function is used, but here we also examine the effect of other window functions like rectangular window, Hann window, B-spline windows, polynomial windows, adjustable windows, hybrid windows and Lanczos window in SR. In the present paper, we briefly describe the analysis window functions and try to evaluate text-independent speaker identification (SI). We also use voice activity detector (VAD) to discard the silence frames before STFA. Indeed, silence frames removal leads to the better performance of SR because MFCC of silent frames make the MFCC feature space intrinsic (MFCC with impurity). Here IITG MV SR database contains speech signal of speakers recorded by different devices, namely, D01, H01, T01, M01 and M02, in different environment, different language, different session. This is the reason for calling the database multi variability. It is observed that VQ classifier performs better than other GMM based classifiers for this database and the classifiers VQ-GMM, VQ-UGM-GMM and the combination of them suffers from singularity problem of covariance matrix. So we evaluate the performance of device D01 for all the classifiers and the three classifiers namely, GMM, UBM-GMM and VQ are used for the remaining four recording devices, H01, T01, M01, M02 because except these three classifiers, all other classifiers suffer from singularity problem of covariance matrix in SI. It is observed that VQ provide the highest accuracy for all the devices.
doi_str_mv 10.1007/s10772-023-10024-1
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2791648039</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2791648039</sourcerecordid><originalsourceid>FETCH-LOGICAL-c1851-14b21fb5f1334055b7965bfffbc4f7c569c8582bd9860c909ec5a2193db258a63</originalsourceid><addsrcrecordid>eNp9kMtKAzEUhoMoWKsv4CrgOpqTTGYmy1K8QcGNgruQySRlapuMyQy1b2_qCO7cnOv3Hw4_QtdAb4HS6i4BrSpGKOMk96wgcIJmIPKoBqCnueY1EFZAeY4uUtpQSmUl2Qy9Lzy2u76LndFbnIaxPeDgsfZ6e0hdwvvOt2GP3ejN0AWfsAsRD_ZrIHlhe5uDH3Dqrf6wEUdrwtp3R_ISnTm9TfbqN8_R28P96_KJrF4en5eLFTFQCyBQNAxcIxxwXlAhmkqWonHONaZwlRGlNLWoWdPKuqRGUmmN0Awkbxsmal3yObqZ7vYxfI42DWoTxpjfT4pVEsqiplxmik2UiSGlaJ3qY7fT8aCAqqODanJQZQfVj4MKsohPopRhv7bx7_Q_qm_s8nRO</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2791648039</pqid></control><display><type>article</type><title>An empirical study on analysis window functions for text-independent speaker recognition</title><source>SpringerLink Journals - AutoHoldings</source><creator>Barai, Bidhan ; Das, Nibaran ; Basu, Subhadip ; Nasipuri, Mita</creator><creatorcontrib>Barai, Bidhan ; Das, Nibaran ; Basu, Subhadip ; Nasipuri, Mita</creatorcontrib><description>This paper describes the effect of analysis window functions on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate analysis window function is required to extract frames from the complete speech signal of a speaker prior to STFA. The number of frames are consider as the number of MFCC feature vectors of a speaker which uniquely represents the speaker in feature space (domain). For the recognition purpose Vector Quantization (VQ) and/or Gaussian Mixture Model (GMM) and/or Universal Background Model GMM (UBM-GMM) based classifiers are used and a comparative study is made. Generally in state-of-the-art MFCC feature vector extraction, Hamming (in some places abbreviated as Ham in this paper) window function is used, but here we also examine the effect of other window functions like rectangular window, Hann window, B-spline windows, polynomial windows, adjustable windows, hybrid windows and Lanczos window in SR. In the present paper, we briefly describe the analysis window functions and try to evaluate text-independent speaker identification (SI). We also use voice activity detector (VAD) to discard the silence frames before STFA. Indeed, silence frames removal leads to the better performance of SR because MFCC of silent frames make the MFCC feature space intrinsic (MFCC with impurity). Here IITG MV SR database contains speech signal of speakers recorded by different devices, namely, D01, H01, T01, M01 and M02, in different environment, different language, different session. This is the reason for calling the database multi variability. It is observed that VQ classifier performs better than other GMM based classifiers for this database and the classifiers VQ-GMM, VQ-UGM-GMM and the combination of them suffers from singularity problem of covariance matrix. So we evaluate the performance of device D01 for all the classifiers and the three classifiers namely, GMM, UBM-GMM and VQ are used for the remaining four recording devices, H01, T01, M01, M02 because except these three classifiers, all other classifiers suffer from singularity problem of covariance matrix in SI. It is observed that VQ provide the highest accuracy for all the devices.</description><identifier>ISSN: 1381-2416</identifier><identifier>EISSN: 1572-8110</identifier><identifier>DOI: 10.1007/s10772-023-10024-1</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Artificial Intelligence ; B spline functions ; Classifiers ; Comparative studies ; Covariance matrix ; Empirical analysis ; Engineering ; Fourier analysis ; Frames (data processing) ; Performance evaluation ; Polynomials ; Probabilistic models ; Recognition ; Signal,Image and Speech Processing ; Silence ; Singularity (mathematics) ; Social Sciences ; Speaker identification ; Speech recognition ; Vector quantization ; Voice activity detectors ; Voice recognition ; Window functions</subject><ispartof>International journal of speech technology, 2023-03, Vol.26 (1), p.211-220</ispartof><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c1851-14b21fb5f1334055b7965bfffbc4f7c569c8582bd9860c909ec5a2193db258a63</cites><orcidid>0000-0003-1780-0461</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s10772-023-10024-1$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s10772-023-10024-1$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,776,780,27901,27902,41464,42533,51294</link.rule.ids></links><search><creatorcontrib>Barai, Bidhan</creatorcontrib><creatorcontrib>Das, Nibaran</creatorcontrib><creatorcontrib>Basu, Subhadip</creatorcontrib><creatorcontrib>Nasipuri, Mita</creatorcontrib><title>An empirical study on analysis window functions for text-independent speaker recognition</title><title>International journal of speech technology</title><addtitle>Int J Speech Technol</addtitle><description>This paper describes the effect of analysis window functions on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate analysis window function is required to extract frames from the complete speech signal of a speaker prior to STFA. The number of frames are consider as the number of MFCC feature vectors of a speaker which uniquely represents the speaker in feature space (domain). For the recognition purpose Vector Quantization (VQ) and/or Gaussian Mixture Model (GMM) and/or Universal Background Model GMM (UBM-GMM) based classifiers are used and a comparative study is made. Generally in state-of-the-art MFCC feature vector extraction, Hamming (in some places abbreviated as Ham in this paper) window function is used, but here we also examine the effect of other window functions like rectangular window, Hann window, B-spline windows, polynomial windows, adjustable windows, hybrid windows and Lanczos window in SR. In the present paper, we briefly describe the analysis window functions and try to evaluate text-independent speaker identification (SI). We also use voice activity detector (VAD) to discard the silence frames before STFA. Indeed, silence frames removal leads to the better performance of SR because MFCC of silent frames make the MFCC feature space intrinsic (MFCC with impurity). Here IITG MV SR database contains speech signal of speakers recorded by different devices, namely, D01, H01, T01, M01 and M02, in different environment, different language, different session. This is the reason for calling the database multi variability. It is observed that VQ classifier performs better than other GMM based classifiers for this database and the classifiers VQ-GMM, VQ-UGM-GMM and the combination of them suffers from singularity problem of covariance matrix. So we evaluate the performance of device D01 for all the classifiers and the three classifiers namely, GMM, UBM-GMM and VQ are used for the remaining four recording devices, H01, T01, M01, M02 because except these three classifiers, all other classifiers suffer from singularity problem of covariance matrix in SI. It is observed that VQ provide the highest accuracy for all the devices.</description><subject>Artificial Intelligence</subject><subject>B spline functions</subject><subject>Classifiers</subject><subject>Comparative studies</subject><subject>Covariance matrix</subject><subject>Empirical analysis</subject><subject>Engineering</subject><subject>Fourier analysis</subject><subject>Frames (data processing)</subject><subject>Performance evaluation</subject><subject>Polynomials</subject><subject>Probabilistic models</subject><subject>Recognition</subject><subject>Signal,Image and Speech Processing</subject><subject>Silence</subject><subject>Singularity (mathematics)</subject><subject>Social Sciences</subject><subject>Speaker identification</subject><subject>Speech recognition</subject><subject>Vector quantization</subject><subject>Voice activity detectors</subject><subject>Voice recognition</subject><subject>Window functions</subject><issn>1381-2416</issn><issn>1572-8110</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNp9kMtKAzEUhoMoWKsv4CrgOpqTTGYmy1K8QcGNgruQySRlapuMyQy1b2_qCO7cnOv3Hw4_QtdAb4HS6i4BrSpGKOMk96wgcIJmIPKoBqCnueY1EFZAeY4uUtpQSmUl2Qy9Lzy2u76LndFbnIaxPeDgsfZ6e0hdwvvOt2GP3ejN0AWfsAsRD_ZrIHlhe5uDH3Dqrf6wEUdrwtp3R_ISnTm9TfbqN8_R28P96_KJrF4en5eLFTFQCyBQNAxcIxxwXlAhmkqWonHONaZwlRGlNLWoWdPKuqRGUmmN0Awkbxsmal3yObqZ7vYxfI42DWoTxpjfT4pVEsqiplxmik2UiSGlaJ3qY7fT8aCAqqODanJQZQfVj4MKsohPopRhv7bx7_Q_qm_s8nRO</recordid><startdate>20230301</startdate><enddate>20230301</enddate><creator>Barai, Bidhan</creator><creator>Das, Nibaran</creator><creator>Basu, Subhadip</creator><creator>Nasipuri, Mita</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7T9</scope><orcidid>https://orcid.org/0000-0003-1780-0461</orcidid></search><sort><creationdate>20230301</creationdate><title>An empirical study on analysis window functions for text-independent speaker recognition</title><author>Barai, Bidhan ; Das, Nibaran ; Basu, Subhadip ; Nasipuri, Mita</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c1851-14b21fb5f1334055b7965bfffbc4f7c569c8582bd9860c909ec5a2193db258a63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Artificial Intelligence</topic><topic>B spline functions</topic><topic>Classifiers</topic><topic>Comparative studies</topic><topic>Covariance matrix</topic><topic>Empirical analysis</topic><topic>Engineering</topic><topic>Fourier analysis</topic><topic>Frames (data processing)</topic><topic>Performance evaluation</topic><topic>Polynomials</topic><topic>Probabilistic models</topic><topic>Recognition</topic><topic>Signal,Image and Speech Processing</topic><topic>Silence</topic><topic>Singularity (mathematics)</topic><topic>Social Sciences</topic><topic>Speaker identification</topic><topic>Speech recognition</topic><topic>Vector quantization</topic><topic>Voice activity detectors</topic><topic>Voice recognition</topic><topic>Window functions</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Barai, Bidhan</creatorcontrib><creatorcontrib>Das, Nibaran</creatorcontrib><creatorcontrib>Basu, Subhadip</creatorcontrib><creatorcontrib>Nasipuri, Mita</creatorcontrib><collection>CrossRef</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><jtitle>International journal of speech technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Barai, Bidhan</au><au>Das, Nibaran</au><au>Basu, Subhadip</au><au>Nasipuri, Mita</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An empirical study on analysis window functions for text-independent speaker recognition</atitle><jtitle>International journal of speech technology</jtitle><stitle>Int J Speech Technol</stitle><date>2023-03-01</date><risdate>2023</risdate><volume>26</volume><issue>1</issue><spage>211</spage><epage>220</epage><pages>211-220</pages><issn>1381-2416</issn><eissn>1572-8110</eissn><abstract>This paper describes the effect of analysis window functions on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate analysis window function is required to extract frames from the complete speech signal of a speaker prior to STFA. The number of frames are consider as the number of MFCC feature vectors of a speaker which uniquely represents the speaker in feature space (domain). For the recognition purpose Vector Quantization (VQ) and/or Gaussian Mixture Model (GMM) and/or Universal Background Model GMM (UBM-GMM) based classifiers are used and a comparative study is made. Generally in state-of-the-art MFCC feature vector extraction, Hamming (in some places abbreviated as Ham in this paper) window function is used, but here we also examine the effect of other window functions like rectangular window, Hann window, B-spline windows, polynomial windows, adjustable windows, hybrid windows and Lanczos window in SR. In the present paper, we briefly describe the analysis window functions and try to evaluate text-independent speaker identification (SI). We also use voice activity detector (VAD) to discard the silence frames before STFA. Indeed, silence frames removal leads to the better performance of SR because MFCC of silent frames make the MFCC feature space intrinsic (MFCC with impurity). Here IITG MV SR database contains speech signal of speakers recorded by different devices, namely, D01, H01, T01, M01 and M02, in different environment, different language, different session. This is the reason for calling the database multi variability. It is observed that VQ classifier performs better than other GMM based classifiers for this database and the classifiers VQ-GMM, VQ-UGM-GMM and the combination of them suffers from singularity problem of covariance matrix. So we evaluate the performance of device D01 for all the classifiers and the three classifiers namely, GMM, UBM-GMM and VQ are used for the remaining four recording devices, H01, T01, M01, M02 because except these three classifiers, all other classifiers suffer from singularity problem of covariance matrix in SI. It is observed that VQ provide the highest accuracy for all the devices.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s10772-023-10024-1</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0003-1780-0461</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 1381-2416
ispartof International journal of speech technology, 2023-03, Vol.26 (1), p.211-220
issn 1381-2416
1572-8110
language eng
recordid cdi_proquest_journals_2791648039
source SpringerLink Journals - AutoHoldings
subjects Artificial Intelligence
B spline functions
Classifiers
Comparative studies
Covariance matrix
Empirical analysis
Engineering
Fourier analysis
Frames (data processing)
Performance evaluation
Polynomials
Probabilistic models
Recognition
Signal,Image and Speech Processing
Silence
Singularity (mathematics)
Social Sciences
Speaker identification
Speech recognition
Vector quantization
Voice activity detectors
Voice recognition
Window functions
title An empirical study on analysis window functions for text-independent speaker recognition
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-05T17%3A32%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20empirical%20study%20on%20analysis%20window%20functions%20for%20text-independent%20speaker%20recognition&rft.jtitle=International%20journal%20of%20speech%20technology&rft.au=Barai,%20Bidhan&rft.date=2023-03-01&rft.volume=26&rft.issue=1&rft.spage=211&rft.epage=220&rft.pages=211-220&rft.issn=1381-2416&rft.eissn=1572-8110&rft_id=info:doi/10.1007/s10772-023-10024-1&rft_dat=%3Cproquest_cross%3E2791648039%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2791648039&rft_id=info:pmid/&rfr_iscdi=true