An empirical study on analysis window functions for text-independent speaker recognition
This paper describes the effect of analysis window functions on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate a...
Gespeichert in:
Veröffentlicht in: | International journal of speech technology 2023-03, Vol.26 (1), p.211-220 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 220 |
---|---|
container_issue | 1 |
container_start_page | 211 |
container_title | International journal of speech technology |
container_volume | 26 |
creator | Barai, Bidhan Das, Nibaran Basu, Subhadip Nasipuri, Mita |
description | This paper describes the effect of
analysis window functions
on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate analysis window function is required to extract frames from the complete speech signal of a speaker prior to STFA. The number of frames are consider as the number of MFCC feature vectors of a speaker which uniquely represents the speaker in feature space (domain). For the recognition purpose
Vector Quantization
(VQ) and/or
Gaussian Mixture Model
(GMM) and/or
Universal Background Model GMM
(UBM-GMM) based classifiers are used and a comparative study is made. Generally in state-of-the-art MFCC feature vector extraction, Hamming (in some places abbreviated as Ham in this paper) window function is used, but here we also examine the effect of other window functions like rectangular window, Hann window, B-spline windows, polynomial windows, adjustable windows, hybrid windows and Lanczos window in SR. In the present paper, we briefly describe the analysis window functions and try to evaluate text-independent speaker identification (SI). We also use voice activity detector (VAD) to discard the silence frames before STFA. Indeed, silence frames removal leads to the better performance of SR because MFCC of silent frames make the MFCC feature space intrinsic (MFCC with impurity). Here IITG MV SR database contains speech signal of speakers recorded by different devices, namely, D01, H01, T01, M01 and M02, in different environment, different language, different session. This is the reason for calling the database multi variability. It is observed that VQ classifier performs better than other GMM based classifiers for this database and the classifiers VQ-GMM, VQ-UGM-GMM and the combination of them suffers from singularity problem of covariance matrix. So we evaluate the performance of device D01 for all the classifiers and the three classifiers namely, GMM, UBM-GMM and VQ are used for the remaining four recording devices, H01, T01, M01, M02 because except these three classifiers, all other classifiers suffer from singularity problem of covariance matrix in SI. It is observed that VQ provide the highest accuracy for all the devices. |
doi_str_mv | 10.1007/s10772-023-10024-1 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2791648039</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2791648039</sourcerecordid><originalsourceid>FETCH-LOGICAL-c1851-14b21fb5f1334055b7965bfffbc4f7c569c8582bd9860c909ec5a2193db258a63</originalsourceid><addsrcrecordid>eNp9kMtKAzEUhoMoWKsv4CrgOpqTTGYmy1K8QcGNgruQySRlapuMyQy1b2_qCO7cnOv3Hw4_QtdAb4HS6i4BrSpGKOMk96wgcIJmIPKoBqCnueY1EFZAeY4uUtpQSmUl2Qy9Lzy2u76LndFbnIaxPeDgsfZ6e0hdwvvOt2GP3ejN0AWfsAsRD_ZrIHlhe5uDH3Dqrf6wEUdrwtp3R_ISnTm9TfbqN8_R28P96_KJrF4en5eLFTFQCyBQNAxcIxxwXlAhmkqWonHONaZwlRGlNLWoWdPKuqRGUmmN0Awkbxsmal3yObqZ7vYxfI42DWoTxpjfT4pVEsqiplxmik2UiSGlaJ3qY7fT8aCAqqODanJQZQfVj4MKsohPopRhv7bx7_Q_qm_s8nRO</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2791648039</pqid></control><display><type>article</type><title>An empirical study on analysis window functions for text-independent speaker recognition</title><source>SpringerLink Journals - AutoHoldings</source><creator>Barai, Bidhan ; Das, Nibaran ; Basu, Subhadip ; Nasipuri, Mita</creator><creatorcontrib>Barai, Bidhan ; Das, Nibaran ; Basu, Subhadip ; Nasipuri, Mita</creatorcontrib><description>This paper describes the effect of
analysis window functions
on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate analysis window function is required to extract frames from the complete speech signal of a speaker prior to STFA. The number of frames are consider as the number of MFCC feature vectors of a speaker which uniquely represents the speaker in feature space (domain). For the recognition purpose
Vector Quantization
(VQ) and/or
Gaussian Mixture Model
(GMM) and/or
Universal Background Model GMM
(UBM-GMM) based classifiers are used and a comparative study is made. Generally in state-of-the-art MFCC feature vector extraction, Hamming (in some places abbreviated as Ham in this paper) window function is used, but here we also examine the effect of other window functions like rectangular window, Hann window, B-spline windows, polynomial windows, adjustable windows, hybrid windows and Lanczos window in SR. In the present paper, we briefly describe the analysis window functions and try to evaluate text-independent speaker identification (SI). We also use voice activity detector (VAD) to discard the silence frames before STFA. Indeed, silence frames removal leads to the better performance of SR because MFCC of silent frames make the MFCC feature space intrinsic (MFCC with impurity). Here IITG MV SR database contains speech signal of speakers recorded by different devices, namely, D01, H01, T01, M01 and M02, in different environment, different language, different session. This is the reason for calling the database multi variability. It is observed that VQ classifier performs better than other GMM based classifiers for this database and the classifiers VQ-GMM, VQ-UGM-GMM and the combination of them suffers from singularity problem of covariance matrix. So we evaluate the performance of device D01 for all the classifiers and the three classifiers namely, GMM, UBM-GMM and VQ are used for the remaining four recording devices, H01, T01, M01, M02 because except these three classifiers, all other classifiers suffer from singularity problem of covariance matrix in SI. It is observed that VQ provide the highest accuracy for all the devices.</description><identifier>ISSN: 1381-2416</identifier><identifier>EISSN: 1572-8110</identifier><identifier>DOI: 10.1007/s10772-023-10024-1</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Artificial Intelligence ; B spline functions ; Classifiers ; Comparative studies ; Covariance matrix ; Empirical analysis ; Engineering ; Fourier analysis ; Frames (data processing) ; Performance evaluation ; Polynomials ; Probabilistic models ; Recognition ; Signal,Image and Speech Processing ; Silence ; Singularity (mathematics) ; Social Sciences ; Speaker identification ; Speech recognition ; Vector quantization ; Voice activity detectors ; Voice recognition ; Window functions</subject><ispartof>International journal of speech technology, 2023-03, Vol.26 (1), p.211-220</ispartof><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c1851-14b21fb5f1334055b7965bfffbc4f7c569c8582bd9860c909ec5a2193db258a63</cites><orcidid>0000-0003-1780-0461</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s10772-023-10024-1$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s10772-023-10024-1$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,776,780,27901,27902,41464,42533,51294</link.rule.ids></links><search><creatorcontrib>Barai, Bidhan</creatorcontrib><creatorcontrib>Das, Nibaran</creatorcontrib><creatorcontrib>Basu, Subhadip</creatorcontrib><creatorcontrib>Nasipuri, Mita</creatorcontrib><title>An empirical study on analysis window functions for text-independent speaker recognition</title><title>International journal of speech technology</title><addtitle>Int J Speech Technol</addtitle><description>This paper describes the effect of
analysis window functions
on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate analysis window function is required to extract frames from the complete speech signal of a speaker prior to STFA. The number of frames are consider as the number of MFCC feature vectors of a speaker which uniquely represents the speaker in feature space (domain). For the recognition purpose
Vector Quantization
(VQ) and/or
Gaussian Mixture Model
(GMM) and/or
Universal Background Model GMM
(UBM-GMM) based classifiers are used and a comparative study is made. Generally in state-of-the-art MFCC feature vector extraction, Hamming (in some places abbreviated as Ham in this paper) window function is used, but here we also examine the effect of other window functions like rectangular window, Hann window, B-spline windows, polynomial windows, adjustable windows, hybrid windows and Lanczos window in SR. In the present paper, we briefly describe the analysis window functions and try to evaluate text-independent speaker identification (SI). We also use voice activity detector (VAD) to discard the silence frames before STFA. Indeed, silence frames removal leads to the better performance of SR because MFCC of silent frames make the MFCC feature space intrinsic (MFCC with impurity). Here IITG MV SR database contains speech signal of speakers recorded by different devices, namely, D01, H01, T01, M01 and M02, in different environment, different language, different session. This is the reason for calling the database multi variability. It is observed that VQ classifier performs better than other GMM based classifiers for this database and the classifiers VQ-GMM, VQ-UGM-GMM and the combination of them suffers from singularity problem of covariance matrix. So we evaluate the performance of device D01 for all the classifiers and the three classifiers namely, GMM, UBM-GMM and VQ are used for the remaining four recording devices, H01, T01, M01, M02 because except these three classifiers, all other classifiers suffer from singularity problem of covariance matrix in SI. It is observed that VQ provide the highest accuracy for all the devices.</description><subject>Artificial Intelligence</subject><subject>B spline functions</subject><subject>Classifiers</subject><subject>Comparative studies</subject><subject>Covariance matrix</subject><subject>Empirical analysis</subject><subject>Engineering</subject><subject>Fourier analysis</subject><subject>Frames (data processing)</subject><subject>Performance evaluation</subject><subject>Polynomials</subject><subject>Probabilistic models</subject><subject>Recognition</subject><subject>Signal,Image and Speech Processing</subject><subject>Silence</subject><subject>Singularity (mathematics)</subject><subject>Social Sciences</subject><subject>Speaker identification</subject><subject>Speech recognition</subject><subject>Vector quantization</subject><subject>Voice activity detectors</subject><subject>Voice recognition</subject><subject>Window functions</subject><issn>1381-2416</issn><issn>1572-8110</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNp9kMtKAzEUhoMoWKsv4CrgOpqTTGYmy1K8QcGNgruQySRlapuMyQy1b2_qCO7cnOv3Hw4_QtdAb4HS6i4BrSpGKOMk96wgcIJmIPKoBqCnueY1EFZAeY4uUtpQSmUl2Qy9Lzy2u76LndFbnIaxPeDgsfZ6e0hdwvvOt2GP3ejN0AWfsAsRD_ZrIHlhe5uDH3Dqrf6wEUdrwtp3R_ISnTm9TfbqN8_R28P96_KJrF4en5eLFTFQCyBQNAxcIxxwXlAhmkqWonHONaZwlRGlNLWoWdPKuqRGUmmN0Awkbxsmal3yObqZ7vYxfI42DWoTxpjfT4pVEsqiplxmik2UiSGlaJ3qY7fT8aCAqqODanJQZQfVj4MKsohPopRhv7bx7_Q_qm_s8nRO</recordid><startdate>20230301</startdate><enddate>20230301</enddate><creator>Barai, Bidhan</creator><creator>Das, Nibaran</creator><creator>Basu, Subhadip</creator><creator>Nasipuri, Mita</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7T9</scope><orcidid>https://orcid.org/0000-0003-1780-0461</orcidid></search><sort><creationdate>20230301</creationdate><title>An empirical study on analysis window functions for text-independent speaker recognition</title><author>Barai, Bidhan ; Das, Nibaran ; Basu, Subhadip ; Nasipuri, Mita</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c1851-14b21fb5f1334055b7965bfffbc4f7c569c8582bd9860c909ec5a2193db258a63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Artificial Intelligence</topic><topic>B spline functions</topic><topic>Classifiers</topic><topic>Comparative studies</topic><topic>Covariance matrix</topic><topic>Empirical analysis</topic><topic>Engineering</topic><topic>Fourier analysis</topic><topic>Frames (data processing)</topic><topic>Performance evaluation</topic><topic>Polynomials</topic><topic>Probabilistic models</topic><topic>Recognition</topic><topic>Signal,Image and Speech Processing</topic><topic>Silence</topic><topic>Singularity (mathematics)</topic><topic>Social Sciences</topic><topic>Speaker identification</topic><topic>Speech recognition</topic><topic>Vector quantization</topic><topic>Voice activity detectors</topic><topic>Voice recognition</topic><topic>Window functions</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Barai, Bidhan</creatorcontrib><creatorcontrib>Das, Nibaran</creatorcontrib><creatorcontrib>Basu, Subhadip</creatorcontrib><creatorcontrib>Nasipuri, Mita</creatorcontrib><collection>CrossRef</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><jtitle>International journal of speech technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Barai, Bidhan</au><au>Das, Nibaran</au><au>Basu, Subhadip</au><au>Nasipuri, Mita</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An empirical study on analysis window functions for text-independent speaker recognition</atitle><jtitle>International journal of speech technology</jtitle><stitle>Int J Speech Technol</stitle><date>2023-03-01</date><risdate>2023</risdate><volume>26</volume><issue>1</issue><spage>211</spage><epage>220</epage><pages>211-220</pages><issn>1381-2416</issn><eissn>1572-8110</eissn><abstract>This paper describes the effect of
analysis window functions
on the performance of Mel Frequency Cepstral Coefficient (MFCC) based speaker recognition (SR). The MFCCs of speech signal are extracted from the fixed length frames using Short Time Fourier Analysis (STFA) technique where an appropriate analysis window function is required to extract frames from the complete speech signal of a speaker prior to STFA. The number of frames are consider as the number of MFCC feature vectors of a speaker which uniquely represents the speaker in feature space (domain). For the recognition purpose
Vector Quantization
(VQ) and/or
Gaussian Mixture Model
(GMM) and/or
Universal Background Model GMM
(UBM-GMM) based classifiers are used and a comparative study is made. Generally in state-of-the-art MFCC feature vector extraction, Hamming (in some places abbreviated as Ham in this paper) window function is used, but here we also examine the effect of other window functions like rectangular window, Hann window, B-spline windows, polynomial windows, adjustable windows, hybrid windows and Lanczos window in SR. In the present paper, we briefly describe the analysis window functions and try to evaluate text-independent speaker identification (SI). We also use voice activity detector (VAD) to discard the silence frames before STFA. Indeed, silence frames removal leads to the better performance of SR because MFCC of silent frames make the MFCC feature space intrinsic (MFCC with impurity). Here IITG MV SR database contains speech signal of speakers recorded by different devices, namely, D01, H01, T01, M01 and M02, in different environment, different language, different session. This is the reason for calling the database multi variability. It is observed that VQ classifier performs better than other GMM based classifiers for this database and the classifiers VQ-GMM, VQ-UGM-GMM and the combination of them suffers from singularity problem of covariance matrix. So we evaluate the performance of device D01 for all the classifiers and the three classifiers namely, GMM, UBM-GMM and VQ are used for the remaining four recording devices, H01, T01, M01, M02 because except these three classifiers, all other classifiers suffer from singularity problem of covariance matrix in SI. It is observed that VQ provide the highest accuracy for all the devices.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s10772-023-10024-1</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0003-1780-0461</orcidid></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1381-2416 |
ispartof | International journal of speech technology, 2023-03, Vol.26 (1), p.211-220 |
issn | 1381-2416 1572-8110 |
language | eng |
recordid | cdi_proquest_journals_2791648039 |
source | SpringerLink Journals - AutoHoldings |
subjects | Artificial Intelligence B spline functions Classifiers Comparative studies Covariance matrix Empirical analysis Engineering Fourier analysis Frames (data processing) Performance evaluation Polynomials Probabilistic models Recognition Signal,Image and Speech Processing Silence Singularity (mathematics) Social Sciences Speaker identification Speech recognition Vector quantization Voice activity detectors Voice recognition Window functions |
title | An empirical study on analysis window functions for text-independent speaker recognition |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-05T17%3A32%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20empirical%20study%20on%20analysis%20window%20functions%20for%20text-independent%20speaker%20recognition&rft.jtitle=International%20journal%20of%20speech%20technology&rft.au=Barai,%20Bidhan&rft.date=2023-03-01&rft.volume=26&rft.issue=1&rft.spage=211&rft.epage=220&rft.pages=211-220&rft.issn=1381-2416&rft.eissn=1572-8110&rft_id=info:doi/10.1007/s10772-023-10024-1&rft_dat=%3Cproquest_cross%3E2791648039%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2791648039&rft_id=info:pmid/&rfr_iscdi=true |