Enhancing Children’s Short Utterance-Based ASV Using Inverse Gamma-tone Filtered Cepstral coefficients

The task of developing an automatic speaker verification (ASV) system for children’s speech is extremely challenging due to the dearth of domain-specific data. The challenges are further exacerbated in the case of short utterances of speech, a relatively unexplored domain in the case of children’s A...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Circuits, systems, and signal processing systems, and signal processing, 2024-05, Vol.43 (5), p.3020-3041
Hauptverfasser: Aziz, Shahid, Shahnawazuddin, S.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 3041
container_issue 5
container_start_page 3020
container_title Circuits, systems, and signal processing
container_volume 43
creator Aziz, Shahid
Shahnawazuddin, S.
description The task of developing an automatic speaker verification (ASV) system for children’s speech is extremely challenging due to the dearth of domain-specific data. The challenges are further exacerbated in the case of short utterances of speech, a relatively unexplored domain in the case of children’s ASV. Voice-based biometric systems require an adequate amount of speech data for enrollment and verification; otherwise, the performance considerably degrades. It is for this reason that the trade-off between convenience and security is gruelling to maintain in practical scenarios. In this paper, we have focused on data paucity and preservation of the higher-frequency contents in order to enhance the performance of a short utterance-based children’s speaker verification system. To deal with data scarcity, an out-of-domain data augmentation approach has been proposed. Since the out-of-domain data used are from adult speakers which are acoustically very different from children’s speech, we have made use of techniques like prosody modification, formant modification, and voice conversion in order to render it acoustically similar to children’s speech prior to augmentation. This helps in not only increasing the amount of training data but also in effectively capturing the missing target attributes which helps in boosting the verification. Further to that, we have resorted to concatenation of the classical Mel-frequency cepstral coefficients (MFCC) features with the Gamma-tone frequency cepstral coefficient (GTF-CC) or with the Inverse Gamma-tone frequency cepstral coefficient (IGTF-CC) features. The feature concatenation of MFCC and IGTF-CC is employed with the sole intention of effectively modeling the human auditory system along with the preservation of higher-frequency contents in the children’s speech data. This feature concatenation approach, when combined with data augmentation, helps in further improvement in the verification performance. The experimental results testify our claims, wherein we have achieved an overall relative reduction of 38.5 % for equal error rate.
doi_str_mv 10.1007/s00034-023-02592-z
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3020236159</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3020236159</sourcerecordid><originalsourceid>FETCH-LOGICAL-c270t-e562c339a0bde931b5eebc01ac3282c011b020960268f64fe98dfaa37d6d0ccd3</originalsourceid><addsrcrecordid>eNp9kM9KAzEQh4MoWKsv4GnBc3SSdP8d69LWQsFDrXgL2exsd8s2W5OtYE--hq_nk5i6gjcPwwzM983Aj5BrBrcMIL5zACBGFLjwFaacHk7IgIWC0TCJk1MyAB4nFBL2ck4unNsAsHSU8gGpJqZSRtdmHWRV3RQWzdfHpwuWVWu7YNV1aP0a6b1yWATj5XOwckd4bt7QOgxmartVtGsNBtO68bSnMty5zqom0C2WZa1rNJ27JGelahxe_fYhWU0nT9kDXTzO5tl4QTWPoaMYRlwLkSrIC0wFy0PEXANTWvCE-4HlwCGNgEdJGY1KTJOiVErERVSA1oUYkpv-7s62r3t0ndy0e2v8Sym8yUXEwtRTvKe0bZ2zWMqdrbfKvksG8pio7BOVXpA_icqDl0QvOQ-bNdq_0_9Y328Ee2Y</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3020236159</pqid></control><display><type>article</type><title>Enhancing Children’s Short Utterance-Based ASV Using Inverse Gamma-tone Filtered Cepstral coefficients</title><source>SpringerLink Journals</source><creator>Aziz, Shahid ; Shahnawazuddin, S.</creator><creatorcontrib>Aziz, Shahid ; Shahnawazuddin, S.</creatorcontrib><description>The task of developing an automatic speaker verification (ASV) system for children’s speech is extremely challenging due to the dearth of domain-specific data. The challenges are further exacerbated in the case of short utterances of speech, a relatively unexplored domain in the case of children’s ASV. Voice-based biometric systems require an adequate amount of speech data for enrollment and verification; otherwise, the performance considerably degrades. It is for this reason that the trade-off between convenience and security is gruelling to maintain in practical scenarios. In this paper, we have focused on data paucity and preservation of the higher-frequency contents in order to enhance the performance of a short utterance-based children’s speaker verification system. To deal with data scarcity, an out-of-domain data augmentation approach has been proposed. Since the out-of-domain data used are from adult speakers which are acoustically very different from children’s speech, we have made use of techniques like prosody modification, formant modification, and voice conversion in order to render it acoustically similar to children’s speech prior to augmentation. This helps in not only increasing the amount of training data but also in effectively capturing the missing target attributes which helps in boosting the verification. Further to that, we have resorted to concatenation of the classical Mel-frequency cepstral coefficients (MFCC) features with the Gamma-tone frequency cepstral coefficient (GTF-CC) or with the Inverse Gamma-tone frequency cepstral coefficient (IGTF-CC) features. The feature concatenation of MFCC and IGTF-CC is employed with the sole intention of effectively modeling the human auditory system along with the preservation of higher-frequency contents in the children’s speech data. This feature concatenation approach, when combined with data augmentation, helps in further improvement in the verification performance. The experimental results testify our claims, wherein we have achieved an overall relative reduction of 38.5 % for equal error rate.</description><identifier>ISSN: 0278-081X</identifier><identifier>EISSN: 1531-5878</identifier><identifier>DOI: 10.1007/s00034-023-02592-z</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Auditory system ; Children ; Circuits and Systems ; Coefficients ; Data augmentation ; Electrical Engineering ; Electronics and Microelectronics ; Engineering ; Error analysis ; Instrumentation ; Linguistics ; Performance degradation ; Prosody ; Signal,Image and Speech Processing ; Speaker identification ; Speech ; Tone ; Verification ; Voice recognition</subject><ispartof>Circuits, systems, and signal processing, 2024-05, Vol.43 (5), p.3020-3041</ispartof><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c270t-e562c339a0bde931b5eebc01ac3282c011b020960268f64fe98dfaa37d6d0ccd3</cites><orcidid>0000-0001-5415-8864</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s00034-023-02592-z$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s00034-023-02592-z$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,776,780,27903,27904,41467,42536,51297</link.rule.ids></links><search><creatorcontrib>Aziz, Shahid</creatorcontrib><creatorcontrib>Shahnawazuddin, S.</creatorcontrib><title>Enhancing Children’s Short Utterance-Based ASV Using Inverse Gamma-tone Filtered Cepstral coefficients</title><title>Circuits, systems, and signal processing</title><addtitle>Circuits Syst Signal Process</addtitle><description>The task of developing an automatic speaker verification (ASV) system for children’s speech is extremely challenging due to the dearth of domain-specific data. The challenges are further exacerbated in the case of short utterances of speech, a relatively unexplored domain in the case of children’s ASV. Voice-based biometric systems require an adequate amount of speech data for enrollment and verification; otherwise, the performance considerably degrades. It is for this reason that the trade-off between convenience and security is gruelling to maintain in practical scenarios. In this paper, we have focused on data paucity and preservation of the higher-frequency contents in order to enhance the performance of a short utterance-based children’s speaker verification system. To deal with data scarcity, an out-of-domain data augmentation approach has been proposed. Since the out-of-domain data used are from adult speakers which are acoustically very different from children’s speech, we have made use of techniques like prosody modification, formant modification, and voice conversion in order to render it acoustically similar to children’s speech prior to augmentation. This helps in not only increasing the amount of training data but also in effectively capturing the missing target attributes which helps in boosting the verification. Further to that, we have resorted to concatenation of the classical Mel-frequency cepstral coefficients (MFCC) features with the Gamma-tone frequency cepstral coefficient (GTF-CC) or with the Inverse Gamma-tone frequency cepstral coefficient (IGTF-CC) features. The feature concatenation of MFCC and IGTF-CC is employed with the sole intention of effectively modeling the human auditory system along with the preservation of higher-frequency contents in the children’s speech data. This feature concatenation approach, when combined with data augmentation, helps in further improvement in the verification performance. The experimental results testify our claims, wherein we have achieved an overall relative reduction of 38.5 % for equal error rate.</description><subject>Auditory system</subject><subject>Children</subject><subject>Circuits and Systems</subject><subject>Coefficients</subject><subject>Data augmentation</subject><subject>Electrical Engineering</subject><subject>Electronics and Microelectronics</subject><subject>Engineering</subject><subject>Error analysis</subject><subject>Instrumentation</subject><subject>Linguistics</subject><subject>Performance degradation</subject><subject>Prosody</subject><subject>Signal,Image and Speech Processing</subject><subject>Speaker identification</subject><subject>Speech</subject><subject>Tone</subject><subject>Verification</subject><subject>Voice recognition</subject><issn>0278-081X</issn><issn>1531-5878</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNp9kM9KAzEQh4MoWKsv4GnBc3SSdP8d69LWQsFDrXgL2exsd8s2W5OtYE--hq_nk5i6gjcPwwzM983Aj5BrBrcMIL5zACBGFLjwFaacHk7IgIWC0TCJk1MyAB4nFBL2ck4unNsAsHSU8gGpJqZSRtdmHWRV3RQWzdfHpwuWVWu7YNV1aP0a6b1yWATj5XOwckd4bt7QOgxmartVtGsNBtO68bSnMty5zqom0C2WZa1rNJ27JGelahxe_fYhWU0nT9kDXTzO5tl4QTWPoaMYRlwLkSrIC0wFy0PEXANTWvCE-4HlwCGNgEdJGY1KTJOiVErERVSA1oUYkpv-7s62r3t0ndy0e2v8Sym8yUXEwtRTvKe0bZ2zWMqdrbfKvksG8pio7BOVXpA_icqDl0QvOQ-bNdq_0_9Y328Ee2Y</recordid><startdate>20240501</startdate><enddate>20240501</enddate><creator>Aziz, Shahid</creator><creator>Shahnawazuddin, S.</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7T9</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-5415-8864</orcidid></search><sort><creationdate>20240501</creationdate><title>Enhancing Children’s Short Utterance-Based ASV Using Inverse Gamma-tone Filtered Cepstral coefficients</title><author>Aziz, Shahid ; Shahnawazuddin, S.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c270t-e562c339a0bde931b5eebc01ac3282c011b020960268f64fe98dfaa37d6d0ccd3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Auditory system</topic><topic>Children</topic><topic>Circuits and Systems</topic><topic>Coefficients</topic><topic>Data augmentation</topic><topic>Electrical Engineering</topic><topic>Electronics and Microelectronics</topic><topic>Engineering</topic><topic>Error analysis</topic><topic>Instrumentation</topic><topic>Linguistics</topic><topic>Performance degradation</topic><topic>Prosody</topic><topic>Signal,Image and Speech Processing</topic><topic>Speaker identification</topic><topic>Speech</topic><topic>Tone</topic><topic>Verification</topic><topic>Voice recognition</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Aziz, Shahid</creatorcontrib><creatorcontrib>Shahnawazuddin, S.</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Circuits, systems, and signal processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Aziz, Shahid</au><au>Shahnawazuddin, S.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Enhancing Children’s Short Utterance-Based ASV Using Inverse Gamma-tone Filtered Cepstral coefficients</atitle><jtitle>Circuits, systems, and signal processing</jtitle><stitle>Circuits Syst Signal Process</stitle><date>2024-05-01</date><risdate>2024</risdate><volume>43</volume><issue>5</issue><spage>3020</spage><epage>3041</epage><pages>3020-3041</pages><issn>0278-081X</issn><eissn>1531-5878</eissn><abstract>The task of developing an automatic speaker verification (ASV) system for children’s speech is extremely challenging due to the dearth of domain-specific data. The challenges are further exacerbated in the case of short utterances of speech, a relatively unexplored domain in the case of children’s ASV. Voice-based biometric systems require an adequate amount of speech data for enrollment and verification; otherwise, the performance considerably degrades. It is for this reason that the trade-off between convenience and security is gruelling to maintain in practical scenarios. In this paper, we have focused on data paucity and preservation of the higher-frequency contents in order to enhance the performance of a short utterance-based children’s speaker verification system. To deal with data scarcity, an out-of-domain data augmentation approach has been proposed. Since the out-of-domain data used are from adult speakers which are acoustically very different from children’s speech, we have made use of techniques like prosody modification, formant modification, and voice conversion in order to render it acoustically similar to children’s speech prior to augmentation. This helps in not only increasing the amount of training data but also in effectively capturing the missing target attributes which helps in boosting the verification. Further to that, we have resorted to concatenation of the classical Mel-frequency cepstral coefficients (MFCC) features with the Gamma-tone frequency cepstral coefficient (GTF-CC) or with the Inverse Gamma-tone frequency cepstral coefficient (IGTF-CC) features. The feature concatenation of MFCC and IGTF-CC is employed with the sole intention of effectively modeling the human auditory system along with the preservation of higher-frequency contents in the children’s speech data. This feature concatenation approach, when combined with data augmentation, helps in further improvement in the verification performance. The experimental results testify our claims, wherein we have achieved an overall relative reduction of 38.5 % for equal error rate.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s00034-023-02592-z</doi><tpages>22</tpages><orcidid>https://orcid.org/0000-0001-5415-8864</orcidid></addata></record>
fulltext fulltext
identifier ISSN: 0278-081X
ispartof Circuits, systems, and signal processing, 2024-05, Vol.43 (5), p.3020-3041
issn 0278-081X
1531-5878
language eng
recordid cdi_proquest_journals_3020236159
source SpringerLink Journals
subjects Auditory system
Children
Circuits and Systems
Coefficients
Data augmentation
Electrical Engineering
Electronics and Microelectronics
Engineering
Error analysis
Instrumentation
Linguistics
Performance degradation
Prosody
Signal,Image and Speech Processing
Speaker identification
Speech
Tone
Verification
Voice recognition
title Enhancing Children’s Short Utterance-Based ASV Using Inverse Gamma-tone Filtered Cepstral coefficients
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T01%3A37%3A40IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Enhancing%20Children%E2%80%99s%20Short%20Utterance-Based%20ASV%20Using%20Inverse%20Gamma-tone%20Filtered%20Cepstral%20coefficients&rft.jtitle=Circuits,%20systems,%20and%20signal%20processing&rft.au=Aziz,%20Shahid&rft.date=2024-05-01&rft.volume=43&rft.issue=5&rft.spage=3020&rft.epage=3041&rft.pages=3020-3041&rft.issn=0278-081X&rft.eissn=1531-5878&rft_id=info:doi/10.1007/s00034-023-02592-z&rft_dat=%3Cproquest_cross%3E3020236159%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3020236159&rft_id=info:pmid/&rfr_iscdi=true