VoxWatch: An open-set speaker recognition benchmark on VoxCeleb
Despite its broad practical applications such as in fraud prevention, open-set speaker identification (OSI) has received less attention in the speaker recognition community compared to speaker verification (SV). OSI deals with determining if a test speech sample belongs to a speaker from a set of pr...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Peri, Raghuveer Sadjadi, Seyed Omid Garcia-Romero, Daniel |
description | Despite its broad practical applications such as in fraud prevention,
open-set speaker identification (OSI) has received less attention in the
speaker recognition community compared to speaker verification (SV). OSI deals
with determining if a test speech sample belongs to a speaker from a set of
pre-enrolled individuals (in-set) or if it is from an out-of-set speaker. In
addition to the typical challenges associated with speech variability, OSI is
prone to the "false-alarm problem"; as the size of the in-set speaker
population (a.k.a watchlist) grows, the out-of-set scores become larger,
leading to increased false alarm rates. This is in particular challenging for
applications in financial institutions and border security where the watchlist
size is typically of the order of several thousand speakers. Therefore, it is
important to systematically quantify the false-alarm problem, and develop
techniques that alleviate the impact of watchlist size on detection
performance. Prior studies on this problem are sparse, and lack a common
benchmark for systematic evaluations. In this paper, we present the first
public benchmark for OSI, developed using the VoxCeleb dataset. We quantify the
effect of the watchlist size and speech duration on the watchlist-based speaker
detection task using three strong neural network based systems. In contrast to
the findings from prior research, we show that the commonly adopted adaptive
score normalization is not guaranteed to improve the performance for this task.
On the other hand, we show that score calibration and score fusion, two other
commonly used techniques in SV, result in significant improvements in OSI
performance. |
doi_str_mv | 10.48550/arxiv.2307.00169 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2307_00169</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2307_00169</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-c68d3d4f9c6934e2b2962352420cea1fddc142f010718481fbc36a3a5b93f27a3</originalsourceid><addsrcrecordid>eNotj71qwzAUhbVkKEkfoFP1AnYlXVm2uoRg-geBLqEdzZV81ZgkspFNSd--bprpcIbzcT7G7qTIdVUU4gHTufvOFYgyF0Iae8PWH_35Eye_f-SbyPuBYjbSxMeB8ECJJ_L9V-ymro_cUfT7E6YDn8s8q-lIbsUWAY8j3V5zyXbPT7v6Ndu-v7zVm22GprSZN1ULrQ7WGwualFPWKCiUVsITytC2XmoVhBSlrHQlg_NgELBwFoIqEZbs_h97MWiG1M1Hfpo_k-ZiAr99mELb</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>VoxWatch: An open-set speaker recognition benchmark on VoxCeleb</title><source>arXiv.org</source><creator>Peri, Raghuveer ; Sadjadi, Seyed Omid ; Garcia-Romero, Daniel</creator><creatorcontrib>Peri, Raghuveer ; Sadjadi, Seyed Omid ; Garcia-Romero, Daniel</creatorcontrib><description>Despite its broad practical applications such as in fraud prevention,
open-set speaker identification (OSI) has received less attention in the
speaker recognition community compared to speaker verification (SV). OSI deals
with determining if a test speech sample belongs to a speaker from a set of
pre-enrolled individuals (in-set) or if it is from an out-of-set speaker. In
addition to the typical challenges associated with speech variability, OSI is
prone to the "false-alarm problem"; as the size of the in-set speaker
population (a.k.a watchlist) grows, the out-of-set scores become larger,
leading to increased false alarm rates. This is in particular challenging for
applications in financial institutions and border security where the watchlist
size is typically of the order of several thousand speakers. Therefore, it is
important to systematically quantify the false-alarm problem, and develop
techniques that alleviate the impact of watchlist size on detection
performance. Prior studies on this problem are sparse, and lack a common
benchmark for systematic evaluations. In this paper, we present the first
public benchmark for OSI, developed using the VoxCeleb dataset. We quantify the
effect of the watchlist size and speech duration on the watchlist-based speaker
detection task using three strong neural network based systems. In contrast to
the findings from prior research, we show that the commonly adopted adaptive
score normalization is not guaranteed to improve the performance for this task.
On the other hand, we show that score calibration and score fusion, two other
commonly used techniques in SV, result in significant improvements in OSI
performance.</description><identifier>DOI: 10.48550/arxiv.2307.00169</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Learning</subject><creationdate>2023-06</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2307.00169$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2307.00169$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Peri, Raghuveer</creatorcontrib><creatorcontrib>Sadjadi, Seyed Omid</creatorcontrib><creatorcontrib>Garcia-Romero, Daniel</creatorcontrib><title>VoxWatch: An open-set speaker recognition benchmark on VoxCeleb</title><description>Despite its broad practical applications such as in fraud prevention,
open-set speaker identification (OSI) has received less attention in the
speaker recognition community compared to speaker verification (SV). OSI deals
with determining if a test speech sample belongs to a speaker from a set of
pre-enrolled individuals (in-set) or if it is from an out-of-set speaker. In
addition to the typical challenges associated with speech variability, OSI is
prone to the "false-alarm problem"; as the size of the in-set speaker
population (a.k.a watchlist) grows, the out-of-set scores become larger,
leading to increased false alarm rates. This is in particular challenging for
applications in financial institutions and border security where the watchlist
size is typically of the order of several thousand speakers. Therefore, it is
important to systematically quantify the false-alarm problem, and develop
techniques that alleviate the impact of watchlist size on detection
performance. Prior studies on this problem are sparse, and lack a common
benchmark for systematic evaluations. In this paper, we present the first
public benchmark for OSI, developed using the VoxCeleb dataset. We quantify the
effect of the watchlist size and speech duration on the watchlist-based speaker
detection task using three strong neural network based systems. In contrast to
the findings from prior research, we show that the commonly adopted adaptive
score normalization is not guaranteed to improve the performance for this task.
On the other hand, we show that score calibration and score fusion, two other
commonly used techniques in SV, result in significant improvements in OSI
performance.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj71qwzAUhbVkKEkfoFP1AnYlXVm2uoRg-geBLqEdzZV81ZgkspFNSd--bprpcIbzcT7G7qTIdVUU4gHTufvOFYgyF0Iae8PWH_35Eye_f-SbyPuBYjbSxMeB8ECJJ_L9V-ymro_cUfT7E6YDn8s8q-lIbsUWAY8j3V5zyXbPT7v6Ndu-v7zVm22GprSZN1ULrQ7WGwualFPWKCiUVsITytC2XmoVhBSlrHQlg_NgELBwFoIqEZbs_h97MWiG1M1Hfpo_k-ZiAr99mELb</recordid><startdate>20230630</startdate><enddate>20230630</enddate><creator>Peri, Raghuveer</creator><creator>Sadjadi, Seyed Omid</creator><creator>Garcia-Romero, Daniel</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230630</creationdate><title>VoxWatch: An open-set speaker recognition benchmark on VoxCeleb</title><author>Peri, Raghuveer ; Sadjadi, Seyed Omid ; Garcia-Romero, Daniel</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-c68d3d4f9c6934e2b2962352420cea1fddc142f010718481fbc36a3a5b93f27a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Peri, Raghuveer</creatorcontrib><creatorcontrib>Sadjadi, Seyed Omid</creatorcontrib><creatorcontrib>Garcia-Romero, Daniel</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Peri, Raghuveer</au><au>Sadjadi, Seyed Omid</au><au>Garcia-Romero, Daniel</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>VoxWatch: An open-set speaker recognition benchmark on VoxCeleb</atitle><date>2023-06-30</date><risdate>2023</risdate><abstract>Despite its broad practical applications such as in fraud prevention,
open-set speaker identification (OSI) has received less attention in the
speaker recognition community compared to speaker verification (SV). OSI deals
with determining if a test speech sample belongs to a speaker from a set of
pre-enrolled individuals (in-set) or if it is from an out-of-set speaker. In
addition to the typical challenges associated with speech variability, OSI is
prone to the "false-alarm problem"; as the size of the in-set speaker
population (a.k.a watchlist) grows, the out-of-set scores become larger,
leading to increased false alarm rates. This is in particular challenging for
applications in financial institutions and border security where the watchlist
size is typically of the order of several thousand speakers. Therefore, it is
important to systematically quantify the false-alarm problem, and develop
techniques that alleviate the impact of watchlist size on detection
performance. Prior studies on this problem are sparse, and lack a common
benchmark for systematic evaluations. In this paper, we present the first
public benchmark for OSI, developed using the VoxCeleb dataset. We quantify the
effect of the watchlist size and speech duration on the watchlist-based speaker
detection task using three strong neural network based systems. In contrast to
the findings from prior research, we show that the commonly adopted adaptive
score normalization is not guaranteed to improve the performance for this task.
On the other hand, we show that score calibration and score fusion, two other
commonly used techniques in SV, result in significant improvements in OSI
performance.</abstract><doi>10.48550/arxiv.2307.00169</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2307.00169 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2307_00169 |
source | arXiv.org |
subjects | Computer Science - Artificial Intelligence Computer Science - Learning |
title | VoxWatch: An open-set speaker recognition benchmark on VoxCeleb |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-19T02%3A59%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=VoxWatch:%20An%20open-set%20speaker%20recognition%20benchmark%20on%20VoxCeleb&rft.au=Peri,%20Raghuveer&rft.date=2023-06-30&rft_id=info:doi/10.48550/arxiv.2307.00169&rft_dat=%3Carxiv_GOX%3E2307_00169%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |