Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis

In the field of text-independent speaker recognition, dynamic models that adapt along the time axis have been proposed to consider the phoneme-varying characteristics of speech. However, a detailed analysis of how dynamic models work depending on phonemes is insufficient. In this paper, we propose t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Kim, Seong-Hu, Nam, Hyeonuk, Park, Yong-Hwa
Format: Artikel
Sprache:eng
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Kim, Seong-Hu
Nam, Hyeonuk
Park, Yong-Hwa
description In the field of text-independent speaker recognition, dynamic models that adapt along the time axis have been proposed to consider the phoneme-varying characteristics of speech. However, a detailed analysis of how dynamic models work depending on phonemes is insufficient. In this paper, we propose temporal dynamic CNN (TDY-CNN) that considers temporal variation of phonemes by applying kernels optimally adapting to each time bin. These kernels adapt to time bins by applying weighted sum of trained basis kernels. Then, an analysis of how adaptive kernels work on different phonemes in various layers is carried out. TDY-ResNet-38(x0.5) using six basis kernels improved an equal error rate (EER), the speaker verification performance, by 17.3% compared to the baseline model ResNet-38(x0.5). In addition, we showed that adaptive kernels depend on phoneme groups and are more phoneme-specific at early layers. The temporal dynamic model adapts itself to phonemes without explicitly given phoneme information during training, and results show the necessity to consider phoneme variation within utterances for more accurate and robust text-independent speaker verification.
doi_str_mv 10.48550/arxiv.2110.03213
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2110_03213</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2110_03213</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-e9d4c17fcdc44890a79e3f45d57dd725c7e952aee0786c7fa853d42bd9e0ad9a3</originalsourceid><addsrcrecordid>eNotj8tqwzAURLXpoqT9gK6qH3AqW1JkLYP7CoS0UNOtuZWuqIgtGdlJ47-vk3YzAwNz4BByl7OlKKVkD5BO_rgs8nlgvMj5NYk1dn1M0NLHKUDnDa1iOMb2MPoY5nWHh3Sp8SemPXUx0RpPY7YJFnucI4z0o0fYY6KfmLzzBs5XCsHS9-8YsMNxpq5n2DT44YZcOWgHvP3vBamfn-rqNdu-vWyq9TaDleIZaitMrpyxRohSM1AauRPSSmWtKqRRqGUBiEyVK6MclJJbUXxZjQysBr4g93_Yi3HTJ99BmpqzeXMx579NBVaR</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis</title><source>arXiv.org</source><creator>Kim, Seong-Hu ; Nam, Hyeonuk ; Park, Yong-Hwa</creator><creatorcontrib>Kim, Seong-Hu ; Nam, Hyeonuk ; Park, Yong-Hwa</creatorcontrib><description>In the field of text-independent speaker recognition, dynamic models that adapt along the time axis have been proposed to consider the phoneme-varying characteristics of speech. However, a detailed analysis of how dynamic models work depending on phonemes is insufficient. In this paper, we propose temporal dynamic CNN (TDY-CNN) that considers temporal variation of phonemes by applying kernels optimally adapting to each time bin. These kernels adapt to time bins by applying weighted sum of trained basis kernels. Then, an analysis of how adaptive kernels work on different phonemes in various layers is carried out. TDY-ResNet-38(x0.5) using six basis kernels improved an equal error rate (EER), the speaker verification performance, by 17.3% compared to the baseline model ResNet-38(x0.5). In addition, we showed that adaptive kernels depend on phoneme groups and are more phoneme-specific at early layers. The temporal dynamic model adapts itself to phonemes without explicitly given phoneme information during training, and results show the necessity to consider phoneme variation within utterances for more accurate and robust text-independent speaker verification.</description><identifier>DOI: 10.48550/arxiv.2110.03213</identifier><language>eng</language><creationdate>2021-10</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,782,887</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2110.03213$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2110.03213$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Kim, Seong-Hu</creatorcontrib><creatorcontrib>Nam, Hyeonuk</creatorcontrib><creatorcontrib>Park, Yong-Hwa</creatorcontrib><title>Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis</title><description>In the field of text-independent speaker recognition, dynamic models that adapt along the time axis have been proposed to consider the phoneme-varying characteristics of speech. However, a detailed analysis of how dynamic models work depending on phonemes is insufficient. In this paper, we propose temporal dynamic CNN (TDY-CNN) that considers temporal variation of phonemes by applying kernels optimally adapting to each time bin. These kernels adapt to time bins by applying weighted sum of trained basis kernels. Then, an analysis of how adaptive kernels work on different phonemes in various layers is carried out. TDY-ResNet-38(x0.5) using six basis kernels improved an equal error rate (EER), the speaker verification performance, by 17.3% compared to the baseline model ResNet-38(x0.5). In addition, we showed that adaptive kernels depend on phoneme groups and are more phoneme-specific at early layers. The temporal dynamic model adapts itself to phonemes without explicitly given phoneme information during training, and results show the necessity to consider phoneme variation within utterances for more accurate and robust text-independent speaker verification.</description><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tqwzAURLXpoqT9gK6qH3AqW1JkLYP7CoS0UNOtuZWuqIgtGdlJ47-vk3YzAwNz4BByl7OlKKVkD5BO_rgs8nlgvMj5NYk1dn1M0NLHKUDnDa1iOMb2MPoY5nWHh3Sp8SemPXUx0RpPY7YJFnucI4z0o0fYY6KfmLzzBs5XCsHS9-8YsMNxpq5n2DT44YZcOWgHvP3vBamfn-rqNdu-vWyq9TaDleIZaitMrpyxRohSM1AauRPSSmWtKqRRqGUBiEyVK6MclJJbUXxZjQysBr4g93_Yi3HTJ99BmpqzeXMx579NBVaR</recordid><startdate>20211007</startdate><enddate>20211007</enddate><creator>Kim, Seong-Hu</creator><creator>Nam, Hyeonuk</creator><creator>Park, Yong-Hwa</creator><scope>GOX</scope></search><sort><creationdate>20211007</creationdate><title>Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis</title><author>Kim, Seong-Hu ; Nam, Hyeonuk ; Park, Yong-Hwa</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-e9d4c17fcdc44890a79e3f45d57dd725c7e952aee0786c7fa853d42bd9e0ad9a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><toplevel>online_resources</toplevel><creatorcontrib>Kim, Seong-Hu</creatorcontrib><creatorcontrib>Nam, Hyeonuk</creatorcontrib><creatorcontrib>Park, Yong-Hwa</creatorcontrib><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Kim, Seong-Hu</au><au>Nam, Hyeonuk</au><au>Park, Yong-Hwa</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis</atitle><date>2021-10-07</date><risdate>2021</risdate><abstract>In the field of text-independent speaker recognition, dynamic models that adapt along the time axis have been proposed to consider the phoneme-varying characteristics of speech. However, a detailed analysis of how dynamic models work depending on phonemes is insufficient. In this paper, we propose temporal dynamic CNN (TDY-CNN) that considers temporal variation of phonemes by applying kernels optimally adapting to each time bin. These kernels adapt to time bins by applying weighted sum of trained basis kernels. Then, an analysis of how adaptive kernels work on different phonemes in various layers is carried out. TDY-ResNet-38(x0.5) using six basis kernels improved an equal error rate (EER), the speaker verification performance, by 17.3% compared to the baseline model ResNet-38(x0.5). In addition, we showed that adaptive kernels depend on phoneme groups and are more phoneme-specific at early layers. The temporal dynamic model adapts itself to phonemes without explicitly given phoneme information during training, and results show the necessity to consider phoneme variation within utterances for more accurate and robust text-independent speaker verification.</abstract><doi>10.48550/arxiv.2110.03213</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2110.03213
ispartof
issn
language eng
recordid cdi_arxiv_primary_2110_03213
source arXiv.org
title Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-05T02%3A54%3A40IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Temporal%20Dynamic%20Convolutional%20Neural%20Network%20for%20Text-Independent%20Speaker%20Verification%20and%20Phonemetic%20Analysis&rft.au=Kim,%20Seong-Hu&rft.date=2021-10-07&rft_id=info:doi/10.48550/arxiv.2110.03213&rft_dat=%3Carxiv_GOX%3E2110_03213%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true