An empirical study on analysis window functions for text-independent speaker recognition

This paper describes the effect of analysis window functions on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate a...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	International journal of speech technology 2023-03, Vol.26 (1), p.211-220
Hauptverfasser:	Barai, Bidhan, Das, Nibaran, Basu, Subhadip, Nasipuri, Mita
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial Intelligence B spline functions Classifiers Comparative studies Covariance matrix Empirical analysis Engineering Fourier analysis Frames (data processing) Performance evaluation Polynomials Probabilistic models Recognition Signal,Image and Speech Processing Silence Singularity (mathematics) Social Sciences Speaker identification Speech recognition Vector quantization Voice activity detectors Voice recognition Window functions
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	220
container_issue	1
container_start_page	211
container_title	International journal of speech technology
container_volume	26
creator	Barai, Bidhan Das, Nibaran Basu, Subhadip Nasipuri, Mita
description	This paper describes the effect of analysis window functions on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate analysis window function is required to extract frames from the complete speech signal of a speaker prior to STFA. The number of frames are consider as the number of MFCC feature vectors of a speaker which uniquely represents the speaker in feature space (domain). For the recognition purpose Vector Quantization (VQ) and/or Gaussian Mixture Model (GMM) and/or Universal Background Model GMM (UBM-GMM) based classifiers are used and a comparative study is made. Generally in state-of-the-art MFCC feature vector extraction, Hamming (in some places abbreviated as Ham in this paper) window function is used, but here we also examine the effect of other window functions like rectangular window, Hann window, B-spline windows, polynomial windows, adjustable windows, hybrid windows and Lanczos window in SR. In the present paper, we briefly describe the analysis window functions and try to evaluate text-independent speaker identification (SI). We also use voice activity detector (VAD) to discard the silence frames before STFA. Indeed, silence frames removal leads to the better performance of SR because MFCC of silent frames make the MFCC feature space intrinsic (MFCC with impurity). Here IITG MV SR database contains speech signal of speakers recorded by different devices, namely, D01, H01, T01, M01 and M02, in different environment, different language, different session. This is the reason for calling the database multi variability. It is observed that VQ classifier performs better than other GMM based classifiers for this database and the classifiers VQ-GMM, VQ-UGM-GMM and the combination of them suffers from singularity problem of covariance matrix. So we evaluate the performance of device D01 for all the classifiers and the three classifiers namely, GMM, UBM-GMM and VQ are used for the remaining four recording devices, H01, T01, M01, M02 because except these three classifiers, all other classifiers suffer from singularity problem of covariance matrix in SI. It is observed that VQ provide the highest accuracy for all the devices.
doi_str_mv	10.1007/s10772-023-10024-1
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2791648039</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2791648039</sourcerecordid><originalsourceid>FETCH-LOGICAL-c1851-14b21fb5f1334055b7965bfffbc4f7c569c8582bd9860c909ec5a2193db258a63</originalsourceid><addsrcrecordid>eNp9kMtKAzEUhoMoWKsv4CrgOpqTTGYmy1K8QcGNgruQySRlapuMyQy1b2_qCO7cnOv3Hw4_QtdAb4HS6i4BrSpGKOMk96wgcIJmIPKoBqCnueY1EFZAeY4uUtpQSmUl2Qy9Lzy2u76LndFbnIaxPeDgsfZ6e0hdwvvOt2GP3ejN0AWfsAsRD_ZrIHlhe5uDH3Dqrf6wEUdrwtp3R_ISnTm9TfbqN8_R28P96_KJrF4en5eLFTFQCyBQNAxcIxxwXlAhmkqWonHONaZwlRGlNLWoWdPKuqRGUmmN0Awkbxsmal3yObqZ7vYxfI42DWoTxpjfT4pVEsqiplxmik2UiSGlaJ3qY7fT8aCAqqODanJQZQfVj4MKsohPopRhv7bx7_Q_qm_s8nRO</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2791648039</pqid></control><display><type>article</type><title>An empirical study on analysis window functions for text-independent speaker recognition</title><source>SpringerLink Journals - AutoHoldings</source><creator>Barai, Bidhan ; Das, Nibaran ; Basu, Subhadip ; Nasipuri, Mita</creator><creatorcontrib>Barai, Bidhan ; Das, Nibaran ; Basu, Subhadip ; Nasipuri, Mita</creatorcontrib><description>This paper describes the effect of analysis window functions on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate analysis window function is required to extract frames from the complete speech signal of a speaker prior to STFA. The number of frames are consider as the number of MFCC feature vectors of a speaker which uniquely represents the speaker in feature space (domain). For the recognition purpose Vector Quantization (VQ) and/or Gaussian Mixture Model (GMM) and/or Universal Background Model GMM (UBM-GMM) based classifiers are used and a comparative study is made. Generally in state-of-the-art MFCC feature vector extraction, Hamming (in some places abbreviated as Ham in this paper) window function is used, but here we also examine the effect of other window functions like rectangular window, Hann window, B-spline windows, polynomial windows, adjustable windows, hybrid windows and Lanczos window in SR. In the present paper, we briefly describe the analysis window functions and try to evaluate text-independent speaker identification (SI). We also use voice activity detector (VAD) to discard the silence frames before STFA. Indeed, silence frames removal leads to the better performance of SR because MFCC of silent frames make the MFCC feature space intrinsic (MFCC with impurity). Here IITG MV SR database contains speech signal of speakers recorded by different devices, namely, D01, H01, T01, M01 and M02, in different environment, different language, different session. This is the reason for calling the database multi variability. It is observed that VQ classifier performs better than other GMM based classifiers for this database and the classifiers VQ-GMM, VQ-UGM-GMM and the combination of them suffers from singularity problem of covariance matrix. So we evaluate the performance of device D01 for all the classifiers and the three classifiers namely, GMM, UBM-GMM and VQ are used for the remaining four recording devices, H01, T01, M01, M02 because except these three classifiers, all other classifiers suffer from singularity problem of covariance matrix in SI. It is observed that VQ provide the highest accuracy for all the devices.</description><identifier>ISSN: 1381-2416</identifier><identifier>EISSN: 1572-8110</identifier><identifier>DOI: 10.1007/s10772-023-10024-1</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Artificial Intelligence ; B spline functions ; Classifiers ; Comparative studies ; Covariance matrix ; Empirical analysis ; Engineering ; Fourier analysis ; Frames (data processing) ; Performance evaluation ; Polynomials ; Probabilistic models ; Recognition ; Signal,Image and Speech Processing ; Silence ; Singularity (mathematics) ; Social Sciences ; Speaker identification ; Speech recognition ; Vector quantization ; Voice activity detectors ; Voice recognition ; Window functions</subject><ispartof>International journal of speech technology, 2023-03, Vol.26 (1), p.211-220</ispartof><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c1851-14b21fb5f1334055b7965bfffbc4f7c569c8582bd9860c909ec5a2193db258a63</cites><orcidid>0000-0003-1780-0461</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s10772-023-10024-1$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s10772-023-10024-1$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,776,780,27901,27902,41464,42533,51294</link.rule.ids></links><search><creatorcontrib>Barai, Bidhan</creatorcontrib><creatorcontrib>Das, Nibaran</creatorcontrib><creatorcontrib>Basu, Subhadip</creatorcontrib><creatorcontrib>Nasipuri, Mita</creatorcontrib><title>An empirical study on analysis window functions for text-independent speaker recognition</title><title>International journal of speech technology</title><addtitle>Int J Speech Technol</addtitle><description>This paper describes the effect of analysis window functions on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate analysis window function is required to extract frames from the complete speech signal of a speaker prior to STFA. The number of frames are consider as the number of MFCC feature vectors of a speaker which uniquely represents the speaker in feature space (domain). For the recognition purpose Vector Quantization (VQ) and/or Gaussian Mixture Model (GMM) and/or Universal Background Model GMM (UBM-GMM) based classifiers are used and a comparative study is made. Generally in state-of-the-art MFCC feature vector extraction, Hamming (in some places abbreviated as Ham in this paper) window function is used, but here we also examine the effect of other window functions like rectangular window, Hann window, B-spline windows, polynomial windows, adjustable windows, hybrid windows and Lanczos window in SR. In the present paper, we briefly describe the analysis window functions and try to evaluate text-independent speaker identification (SI). We also use voice activity detector (VAD) to discard the silence frames before STFA. Indeed, silence frames removal leads to the better performance of SR because MFCC of silent frames make the MFCC feature space intrinsic (MFCC with impurity). Here IITG MV SR database contains speech signal of speakers recorded by different devices, namely, D01, H01, T01, M01 and M02, in different environment, different language, different session. This is the reason for calling the database multi variability. It is observed that VQ classifier performs better than other GMM based classifiers for this database and the classifiers VQ-GMM, VQ-UGM-GMM and the combination of them suffers from singularity problem of covariance matrix. So we evaluate the performance of device D01 for all the classifiers and the three classifiers namely, GMM, UBM-GMM and VQ are used for the remaining four recording devices, H01, T01, M01, M02 because except these three classifiers, all other classifiers suffer from singularity problem of covariance matrix in SI. It is observed that VQ provide the highest accuracy for all the devices.</description><subject>Artificial Intelligence</subject><subject>B spline functions</subject><subject>Classifiers</subject><subject>Comparative studies</subject><subject>Covariance matrix</subject><subject>Empirical analysis</subject><subject>Engineering</subject><subject>Fourier analysis</subject><subject>Frames (data processing)</subject><subject>Performance evaluation</subject><subject>Polynomials</subject><subject>Probabilistic models</subject><subject>Recognition</subject><subject>Signal,Image and Speech Processing</subject><subject>Silence</subject><subject>Singularity (mathematics)</subject><subject>Social Sciences</subject><subject>Speaker identification</subject><subject>Speech recognition</subject><subject>Vector quantization</subject><subject>Voice activity detectors</subject><subject>Voice recognition</subject><subject>Window functions</subject><issn>1381-2416</issn><issn>1572-8110</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNp9kMtKAzEUhoMoWKsv4CrgOpqTTGYmy1K8QcGNgruQySRlapuMyQy1b2_qCO7cnOv3Hw4_QtdAb4HS6i4BrSpGKOMk96wgcIJmIPKoBqCnueY1EFZAeY4uUtpQSmUl2Qy9Lzy2u76LndFbnIaxPeDgsfZ6e0hdwvvOt2GP3ejN0AWfsAsRD_ZrIHlhe5uDH3Dqrf6wEUdrwtp3R_ISnTm9TfbqN8_R28P96_KJrF4en5eLFTFQCyBQNAxcIxxwXlAhmkqWonHONaZwlRGlNLWoWdPKuqRGUmmN0Awkbxsmal3yObqZ7vYxfI42DWoTxpjfT4pVEsqiplxmik2UiSGlaJ3qY7fT8aCAqqODanJQZQfVj4MKsohPopRhv7bx7_Q_qm_s8nRO</recordid><startdate>20230301</startdate><enddate>20230301</enddate><creator>Barai, Bidhan</creator><creator>Das, Nibaran</creator><creator>Basu, Subhadip</creator><creator>Nasipuri, Mita</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7T9</scope><orcidid>https://orcid.org/0000-0003-1780-0461</orcidid></search><sort><creationdate>20230301</creationdate><title>An empirical study on analysis window functions for text-independent speaker recognition</title><author>Barai, Bidhan ; Das, Nibaran ; Basu, Subhadip ; Nasipuri, Mita</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c1851-14b21fb5f1334055b7965bfffbc4f7c569c8582bd9860c909ec5a2193db258a63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Artificial Intelligence</topic><topic>B spline functions</topic><topic>Classifiers</topic><topic>Comparative studies</topic><topic>Covariance matrix</topic><topic>Empirical analysis</topic><topic>Engineering</topic><topic>Fourier analysis</topic><topic>Frames (data processing)</topic><topic>Performance evaluation</topic><topic>Polynomials</topic><topic>Probabilistic models</topic><topic>Recognition</topic><topic>Signal,Image and Speech Processing</topic><topic>Silence</topic><topic>Singularity (mathematics)</topic><topic>Social Sciences</topic><topic>Speaker identification</topic><topic>Speech recognition</topic><topic>Vector quantization</topic><topic>Voice activity detectors</topic><topic>Voice recognition</topic><topic>Window functions</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Barai, Bidhan</creatorcontrib><creatorcontrib>Das, Nibaran</creatorcontrib><creatorcontrib>Basu, Subhadip</creatorcontrib><creatorcontrib>Nasipuri, Mita</creatorcontrib><collection>CrossRef</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><jtitle>International journal of speech technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Barai, Bidhan</au><au>Das, Nibaran</au><au>Basu, Subhadip</au><au>Nasipuri, Mita</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An empirical study on analysis window functions for text-independent speaker recognition</atitle><jtitle>International journal of speech technology</jtitle><stitle>Int J Speech Technol</stitle><date>2023-03-01</date><risdate>2023</risdate><volume>26</volume><issue>1</issue><spage>211</spage><epage>220</epage><pages>211-220</pages><issn>1381-2416</issn><eissn>1572-8110</eissn><abstract>This paper describes the effect of analysis window functions on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate analysis window function is required to extract frames from the complete speech signal of a speaker prior to STFA. The number of frames are consider as the number of MFCC feature vectors of a speaker which uniquely represents the speaker in feature space (domain). For the recognition purpose Vector Quantization (VQ) and/or Gaussian Mixture Model (GMM) and/or Universal Background Model GMM (UBM-GMM) based classifiers are used and a comparative study is made. Generally in state-of-the-art MFCC feature vector extraction, Hamming (in some places abbreviated as Ham in this paper) window function is used, but here we also examine the effect of other window functions like rectangular window, Hann window, B-spline windows, polynomial windows, adjustable windows, hybrid windows and Lanczos window in SR. In the present paper, we briefly describe the analysis window functions and try to evaluate text-independent speaker identification (SI). We also use voice activity detector (VAD) to discard the silence frames before STFA. Indeed, silence frames removal leads to the better performance of SR because MFCC of silent frames make the MFCC feature space intrinsic (MFCC with impurity). Here IITG MV SR database contains speech signal of speakers recorded by different devices, namely, D01, H01, T01, M01 and M02, in different environment, different language, different session. This is the reason for calling the database multi variability. It is observed that VQ classifier performs better than other GMM based classifiers for this database and the classifiers VQ-GMM, VQ-UGM-GMM and the combination of them suffers from singularity problem of covariance matrix. So we evaluate the performance of device D01 for all the classifiers and the three classifiers namely, GMM, UBM-GMM and VQ are used for the remaining four recording devices, H01, T01, M01, M02 because except these three classifiers, all other classifiers suffer from singularity problem of covariance matrix in SI. It is observed that VQ provide the highest accuracy for all the devices.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s10772-023-10024-1</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0003-1780-0461</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 1381-2416
ispartof	International journal of speech technology, 2023-03, Vol.26 (1), p.211-220
issn	1381-2416 1572-8110
language	eng
recordid	cdi_proquest_journals_2791648039
source	SpringerLink Journals - AutoHoldings
subjects	Artificial Intelligence B spline functions Classifiers Comparative studies Covariance matrix Empirical analysis Engineering Fourier analysis Frames (data processing) Performance evaluation Polynomials Probabilistic models Recognition Signal,Image and Speech Processing Silence Singularity (mathematics) Social Sciences Speaker identification Speech recognition Vector quantization Voice activity detectors Voice recognition Window functions
title	An empirical study on analysis window functions for text-independent speaker recognition
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-05T17%3A32%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20empirical%20study%20on%20analysis%20window%20functions%20for%20text-independent%20speaker%20recognition&rft.jtitle=International%20journal%20of%20speech%20technology&rft.au=Barai,%20Bidhan&rft.date=2023-03-01&rft.volume=26&rft.issue=1&rft.spage=211&rft.epage=220&rft.pages=211-220&rft.issn=1381-2416&rft.eissn=1572-8110&rft_id=info:doi/10.1007/s10772-023-10024-1&rft_dat=%3Cproquest_cross%3E2791648039%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2791648039&rft_id=info:pmid/&rfr_iscdi=true