Short-Utterance-Based Children’s Speaker Verification in Low-Resource Conditions

The task of developing an automatic speaker verification (ASV) system for children is extremely challenging due to unavailability of sufficiently large and free speech corpora from child speakers . On the other hand, hundreds of hours of speech data from adult speakers are freely available. Therefor...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Circuits, systems, and signal processing systems, and signal processing, 2024-03, Vol.43 (3), p.1715-1740
Hauptverfasser: Aziz, Shahid, Ankita, Shahnawazuddin, S.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1740
container_issue 3
container_start_page 1715
container_title Circuits, systems, and signal processing
container_volume 43
creator Aziz, Shahid
Ankita
Shahnawazuddin, S.
description The task of developing an automatic speaker verification (ASV) system for children is extremely challenging due to unavailability of sufficiently large and free speech corpora from child speakers . On the other hand, hundreds of hours of speech data from adult speakers are freely available. Therefore, majority of the works on speaker verification reported in the literature deal predominantly with adults’ speech, while only a few works dealing with children’s speech have been published. The challenges in developing a robust ASV system for child speakers are further exacerbated when we use short utterances which is largely unexplored in the case of children’s speech . Therefore, in this paper, we have focused on children’s speaker verification using short utterances. To deal with data scarcity, several out-of-domain data augmentation techniques have been utilized. Since the out-of-domain data used in this study is from adult speakers which is acoustically very different from children’s speech, we have resorted to techniques like prosody modification, formant modification, and voice conversion in order to render it acoustically similar to children’s speech prior to augmentation. This helps in not only increasing the amount of training data, but also in effectively capturing the missing target attributes relevant to children’s speech. A staggering relative improvement of 33.57% in equal error rate with respect to the baseline system trained solely on child dataset speaks volume of the effectiveness of the proposed data augmentation technique in this paper. Further to that, we have also proposed frame-level concatenation of Mel-frequency cepstral coefficients (MFCC) with frequency-domain linear prediction coefficients, in order to simultaneously model the spectral as well as temporal envelopes. The proposed idea of frame-level concatenation is expected to further enhance the discrimination among the speakers. This novel approach, when combined with data augmentation, helps in further improving the performance of the speaker verification system. The experimental results support our claims, wherein we have achieved an overall relative reduction of 38.04 % for equal error rate.
doi_str_mv 10.1007/s00034-023-02535-8
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2931850918</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2931850918</sourcerecordid><originalsourceid>FETCH-LOGICAL-c270t-d00b40331dc8f09a0433ccd6c3b34b5eee97047fbf571888762f3edbc6b32cfe3</originalsourceid><addsrcrecordid>eNp9kE1OwzAQhS0EEqVwAVaRWBvGnqR2lhDxJ0VCailiZyXOmKaUpNipEDuuwfU4CSlBYsdiNIt5783Tx9ixgFMBoM4CAGDMQWI_CSZc77CRSFDwRCu9y0YgleagxeM-OwhhCSDSOJUjNp0tWt_xedeRLxpL_KIIVEXZol5Vnpqvj88QzdZUPJOPHsjXrrZFV7dNVDdR3r7xKYV24y1FWdtU9fYSDtmeK1aBjn73mM2vLu-zG57fXd9m5zm3UkHHK4AyBkRRWe0gLSBGtLaaWCwxLhMiShXEypUuUUJrrSbSIVWlnZQorSMcs5Mhd-3b1w2Fziz7Kk3_0sgUhU4gFbpXyUFlfRuCJ2fWvn4p_LsRYLbszMDO9OzMDzuzNeFgCr24eSL_F_2P6xuQu3Mc</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2931850918</pqid></control><display><type>article</type><title>Short-Utterance-Based Children’s Speaker Verification in Low-Resource Conditions</title><source>Springer Nature - Complete Springer Journals</source><creator>Aziz, Shahid ; Ankita ; Shahnawazuddin, S.</creator><creatorcontrib>Aziz, Shahid ; Ankita ; Shahnawazuddin, S.</creatorcontrib><description>The task of developing an automatic speaker verification (ASV) system for children is extremely challenging due to unavailability of sufficiently large and free speech corpora from child speakers . On the other hand, hundreds of hours of speech data from adult speakers are freely available. Therefore, majority of the works on speaker verification reported in the literature deal predominantly with adults’ speech, while only a few works dealing with children’s speech have been published. The challenges in developing a robust ASV system for child speakers are further exacerbated when we use short utterances which is largely unexplored in the case of children’s speech . Therefore, in this paper, we have focused on children’s speaker verification using short utterances. To deal with data scarcity, several out-of-domain data augmentation techniques have been utilized. Since the out-of-domain data used in this study is from adult speakers which is acoustically very different from children’s speech, we have resorted to techniques like prosody modification, formant modification, and voice conversion in order to render it acoustically similar to children’s speech prior to augmentation. This helps in not only increasing the amount of training data, but also in effectively capturing the missing target attributes relevant to children’s speech. A staggering relative improvement of 33.57% in equal error rate with respect to the baseline system trained solely on child dataset speaks volume of the effectiveness of the proposed data augmentation technique in this paper. Further to that, we have also proposed frame-level concatenation of Mel-frequency cepstral coefficients (MFCC) with frequency-domain linear prediction coefficients, in order to simultaneously model the spectral as well as temporal envelopes. The proposed idea of frame-level concatenation is expected to further enhance the discrimination among the speakers. This novel approach, when combined with data augmentation, helps in further improving the performance of the speaker verification system. The experimental results support our claims, wherein we have achieved an overall relative reduction of 38.04 % for equal error rate.</description><identifier>ISSN: 0278-081X</identifier><identifier>EISSN: 1531-5878</identifier><identifier>DOI: 10.1007/s00034-023-02535-8</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Availability ; Children ; Children &amp; youth ; Circuits and Systems ; Data augmentation ; Electrical Engineering ; Electronics and Microelectronics ; Engineering ; Error analysis ; Instrumentation ; Linear prediction ; Linguistics ; Prosody ; Signal,Image and Speech Processing ; Speaker identification ; Speech ; Verification ; Voice recognition</subject><ispartof>Circuits, systems, and signal processing, 2024-03, Vol.43 (3), p.1715-1740</ispartof><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c270t-d00b40331dc8f09a0433ccd6c3b34b5eee97047fbf571888762f3edbc6b32cfe3</cites><orcidid>0000-0001-5415-8864</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s00034-023-02535-8$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s00034-023-02535-8$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,776,780,27903,27904,41467,42536,51297</link.rule.ids></links><search><creatorcontrib>Aziz, Shahid</creatorcontrib><creatorcontrib>Ankita</creatorcontrib><creatorcontrib>Shahnawazuddin, S.</creatorcontrib><title>Short-Utterance-Based Children’s Speaker Verification in Low-Resource Conditions</title><title>Circuits, systems, and signal processing</title><addtitle>Circuits Syst Signal Process</addtitle><description>The task of developing an automatic speaker verification (ASV) system for children is extremely challenging due to unavailability of sufficiently large and free speech corpora from child speakers . On the other hand, hundreds of hours of speech data from adult speakers are freely available. Therefore, majority of the works on speaker verification reported in the literature deal predominantly with adults’ speech, while only a few works dealing with children’s speech have been published. The challenges in developing a robust ASV system for child speakers are further exacerbated when we use short utterances which is largely unexplored in the case of children’s speech . Therefore, in this paper, we have focused on children’s speaker verification using short utterances. To deal with data scarcity, several out-of-domain data augmentation techniques have been utilized. Since the out-of-domain data used in this study is from adult speakers which is acoustically very different from children’s speech, we have resorted to techniques like prosody modification, formant modification, and voice conversion in order to render it acoustically similar to children’s speech prior to augmentation. This helps in not only increasing the amount of training data, but also in effectively capturing the missing target attributes relevant to children’s speech. A staggering relative improvement of 33.57% in equal error rate with respect to the baseline system trained solely on child dataset speaks volume of the effectiveness of the proposed data augmentation technique in this paper. Further to that, we have also proposed frame-level concatenation of Mel-frequency cepstral coefficients (MFCC) with frequency-domain linear prediction coefficients, in order to simultaneously model the spectral as well as temporal envelopes. The proposed idea of frame-level concatenation is expected to further enhance the discrimination among the speakers. This novel approach, when combined with data augmentation, helps in further improving the performance of the speaker verification system. The experimental results support our claims, wherein we have achieved an overall relative reduction of 38.04 % for equal error rate.</description><subject>Availability</subject><subject>Children</subject><subject>Children &amp; youth</subject><subject>Circuits and Systems</subject><subject>Data augmentation</subject><subject>Electrical Engineering</subject><subject>Electronics and Microelectronics</subject><subject>Engineering</subject><subject>Error analysis</subject><subject>Instrumentation</subject><subject>Linear prediction</subject><subject>Linguistics</subject><subject>Prosody</subject><subject>Signal,Image and Speech Processing</subject><subject>Speaker identification</subject><subject>Speech</subject><subject>Verification</subject><subject>Voice recognition</subject><issn>0278-081X</issn><issn>1531-5878</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNp9kE1OwzAQhS0EEqVwAVaRWBvGnqR2lhDxJ0VCailiZyXOmKaUpNipEDuuwfU4CSlBYsdiNIt5783Tx9ixgFMBoM4CAGDMQWI_CSZc77CRSFDwRCu9y0YgleagxeM-OwhhCSDSOJUjNp0tWt_xedeRLxpL_KIIVEXZol5Vnpqvj88QzdZUPJOPHsjXrrZFV7dNVDdR3r7xKYV24y1FWdtU9fYSDtmeK1aBjn73mM2vLu-zG57fXd9m5zm3UkHHK4AyBkRRWe0gLSBGtLaaWCwxLhMiShXEypUuUUJrrSbSIVWlnZQorSMcs5Mhd-3b1w2Fziz7Kk3_0sgUhU4gFbpXyUFlfRuCJ2fWvn4p_LsRYLbszMDO9OzMDzuzNeFgCr24eSL_F_2P6xuQu3Mc</recordid><startdate>20240301</startdate><enddate>20240301</enddate><creator>Aziz, Shahid</creator><creator>Ankita</creator><creator>Shahnawazuddin, S.</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7T9</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-5415-8864</orcidid></search><sort><creationdate>20240301</creationdate><title>Short-Utterance-Based Children’s Speaker Verification in Low-Resource Conditions</title><author>Aziz, Shahid ; Ankita ; Shahnawazuddin, S.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c270t-d00b40331dc8f09a0433ccd6c3b34b5eee97047fbf571888762f3edbc6b32cfe3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Availability</topic><topic>Children</topic><topic>Children &amp; youth</topic><topic>Circuits and Systems</topic><topic>Data augmentation</topic><topic>Electrical Engineering</topic><topic>Electronics and Microelectronics</topic><topic>Engineering</topic><topic>Error analysis</topic><topic>Instrumentation</topic><topic>Linear prediction</topic><topic>Linguistics</topic><topic>Prosody</topic><topic>Signal,Image and Speech Processing</topic><topic>Speaker identification</topic><topic>Speech</topic><topic>Verification</topic><topic>Voice recognition</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Aziz, Shahid</creatorcontrib><creatorcontrib>Ankita</creatorcontrib><creatorcontrib>Shahnawazuddin, S.</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Circuits, systems, and signal processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Aziz, Shahid</au><au>Ankita</au><au>Shahnawazuddin, S.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Short-Utterance-Based Children’s Speaker Verification in Low-Resource Conditions</atitle><jtitle>Circuits, systems, and signal processing</jtitle><stitle>Circuits Syst Signal Process</stitle><date>2024-03-01</date><risdate>2024</risdate><volume>43</volume><issue>3</issue><spage>1715</spage><epage>1740</epage><pages>1715-1740</pages><issn>0278-081X</issn><eissn>1531-5878</eissn><abstract>The task of developing an automatic speaker verification (ASV) system for children is extremely challenging due to unavailability of sufficiently large and free speech corpora from child speakers . On the other hand, hundreds of hours of speech data from adult speakers are freely available. Therefore, majority of the works on speaker verification reported in the literature deal predominantly with adults’ speech, while only a few works dealing with children’s speech have been published. The challenges in developing a robust ASV system for child speakers are further exacerbated when we use short utterances which is largely unexplored in the case of children’s speech . Therefore, in this paper, we have focused on children’s speaker verification using short utterances. To deal with data scarcity, several out-of-domain data augmentation techniques have been utilized. Since the out-of-domain data used in this study is from adult speakers which is acoustically very different from children’s speech, we have resorted to techniques like prosody modification, formant modification, and voice conversion in order to render it acoustically similar to children’s speech prior to augmentation. This helps in not only increasing the amount of training data, but also in effectively capturing the missing target attributes relevant to children’s speech. A staggering relative improvement of 33.57% in equal error rate with respect to the baseline system trained solely on child dataset speaks volume of the effectiveness of the proposed data augmentation technique in this paper. Further to that, we have also proposed frame-level concatenation of Mel-frequency cepstral coefficients (MFCC) with frequency-domain linear prediction coefficients, in order to simultaneously model the spectral as well as temporal envelopes. The proposed idea of frame-level concatenation is expected to further enhance the discrimination among the speakers. This novel approach, when combined with data augmentation, helps in further improving the performance of the speaker verification system. The experimental results support our claims, wherein we have achieved an overall relative reduction of 38.04 % for equal error rate.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s00034-023-02535-8</doi><tpages>26</tpages><orcidid>https://orcid.org/0000-0001-5415-8864</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 0278-081X
ispartof Circuits, systems, and signal processing, 2024-03, Vol.43 (3), p.1715-1740
issn 0278-081X
1531-5878
language eng
recordid cdi_proquest_journals_2931850918
source Springer Nature - Complete Springer Journals
subjects Availability
Children
Children & youth
Circuits and Systems
Data augmentation
Electrical Engineering
Electronics and Microelectronics
Engineering
Error analysis
Instrumentation
Linear prediction
Linguistics
Prosody
Signal,Image and Speech Processing
Speaker identification
Speech
Verification
Voice recognition
title Short-Utterance-Based Children’s Speaker Verification in Low-Resource Conditions
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T01%3A43%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Short-Utterance-Based%20Children%E2%80%99s%20Speaker%20Verification%20in%20Low-Resource%20Conditions&rft.jtitle=Circuits,%20systems,%20and%20signal%20processing&rft.au=Aziz,%20Shahid&rft.date=2024-03-01&rft.volume=43&rft.issue=3&rft.spage=1715&rft.epage=1740&rft.pages=1715-1740&rft.issn=0278-081X&rft.eissn=1531-5878&rft_id=info:doi/10.1007/s00034-023-02535-8&rft_dat=%3Cproquest_cross%3E2931850918%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2931850918&rft_id=info:pmid/&rfr_iscdi=true