Unsupervised Speaker Identification in TV Broadcast Based on Written Names

Identifying speakers in TV broadcast in an unsupervised way (i.e., without biometric models) is a solution for avoiding costly annotations. Existing methods usually use pronounced names, as a source of names, for identifying speech clusters provided by a diarization step but this source is too impre...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on audio, speech, and language processing speech, and language processing, 2015-01, Vol.23 (1), p.57-68
Hauptverfasser:	Poignant, Johann, Besacier, Laurent, Quénot, Georges
Format:	Artikel
Sprache:	eng
Schlagworte:	Computation and Language Computer Science Document and Text Processing Error analysis IEEE transactions Manuals Multimodal fusion speaker diarization speaker identification Speech Speech processing TV broadcast TV broadcasting Videos written names
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	68
container_issue	1
container_start_page	57
container_title	IEEE transactions on audio, speech, and language processing
container_volume	23
creator	Poignant, Johann Besacier, Laurent Quénot, Georges
description	Identifying speakers in TV broadcast in an unsupervised way (i.e., without biometric models) is a solution for avoiding costly annotations. Existing methods usually use pronounced names, as a source of names, for identifying speech clusters provided by a diarization step but this source is too imprecise for having sufficient confidence. To overcome this issue, another source of names can be used: the names written in a title block in the image track. We first compared these two sources of names on their abilities to provide the name of the speakers in TV broadcast. This study shows that it is more interesting to use written names for their high precision for identifying the current speaker. We also propose two approaches for finding speaker identity based only on names written in the image track. With the "late naming" approach, we propose different propagations of written names onto clusters. Our second proposition, "Early naming," modifies the speaker diarization module (agglomerative clustering) by adding constraints preventing two clusters with different associated written names to be merged together. These methods were tested on the REPERE corpus phase 1, containing 3 hours of annotated videos. Our best "late naming" system reaches an F-measure of 73.1%. "early naming" improves over this result both in terms of identification error rate and of stability of the clustering stopping criterion. By comparison, a mono-modal, supervised speaker identification system with 535 speaker models trained on matching development data and additional TV and radio data only provided a 57.2% F-measure.
doi_str_mv	10.1109/TASLP.2014.2367822
format	Article
fullrecord	<record><control><sourceid>hal_RIE</sourceid><recordid>TN_cdi_ieee_primary_6949118</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>6949118</ieee_id><sourcerecordid>oai_HAL_hal_01060827v1</sourcerecordid><originalsourceid>FETCH-LOGICAL-c394t-9965ce7eab6c5d0ff7c135f37d8c1c49bb251d606d28e73cbeb456108baeebac3</originalsourceid><addsrcrecordid>eNo9kE1PwkAQhjdGEwnyB_TSq4fifrS73SMQFUyjJoAeN_sxjavQkt1K4r-3FeQ0k5n3mWQehK4JHhOC5d1qsixfxxSTbEwZFwWlZ2hAGZWpZDg7_--pxJdoFOMnxphgIaXIBuhpXcfvHYS9j-CS5Q70F4Rk4aBufeWtbn1TJ75OVm_JNDTaWR3bZKr7cLd4D75toU6e9RbiFbqo9CbC6FiHaP1wv5rN0_LlcTGblKllMmtTKXluQYA23OYOV5WwhOUVE66wxGbSGJoTxzF3tADBrAGT5ZzgwmgAoy0botvD3Q-9Ubvgtzr8qEZ7NZ-Uqp9133FcULEnXZYesjY0MQaoTgDBqpen_uSpXp46yuugmwPkAeAEcJlJQgr2C_PFazg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Unsupervised Speaker Identification in TV Broadcast Based on Written Names</title><source>IEEE Electronic Library (IEL)</source><creator>Poignant, Johann ; Besacier, Laurent ; Quénot, Georges</creator><creatorcontrib>Poignant, Johann ; Besacier, Laurent ; Quénot, Georges</creatorcontrib><description>Identifying speakers in TV broadcast in an unsupervised way (i.e., without biometric models) is a solution for avoiding costly annotations. Existing methods usually use pronounced names, as a source of names, for identifying speech clusters provided by a diarization step but this source is too imprecise for having sufficient confidence. To overcome this issue, another source of names can be used: the names written in a title block in the image track. We first compared these two sources of names on their abilities to provide the name of the speakers in TV broadcast. This study shows that it is more interesting to use written names for their high precision for identifying the current speaker. We also propose two approaches for finding speaker identity based only on names written in the image track. With the "late naming" approach, we propose different propagations of written names onto clusters. Our second proposition, "Early naming," modifies the speaker diarization module (agglomerative clustering) by adding constraints preventing two clusters with different associated written names to be merged together. These methods were tested on the REPERE corpus phase 1, containing 3 hours of annotated videos. Our best "late naming" system reaches an F-measure of 73.1%. "early naming" improves over this result both in terms of identification error rate and of stability of the clustering stopping criterion. By comparison, a mono-modal, supervised speaker identification system with 535 speaker models trained on matching development data and additional TV and radio data only provided a 57.2% F-measure.</description><identifier>ISSN: 2329-9290</identifier><identifier>ISSN: 1558-7916</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2014.2367822</identifier><identifier>CODEN: ITASD8</identifier><language>eng</language><publisher>IEEE</publisher><subject>Computation and Language ; Computer Science ; Document and Text Processing ; Error analysis ; IEEE transactions ; Manuals ; Multimodal fusion ; speaker diarization ; speaker identification ; Speech ; Speech processing ; TV broadcast ; TV broadcasting ; Videos ; written names</subject><ispartof>IEEE transactions on audio, speech, and language processing, 2015-01, Vol.23 (1), p.57-68</ispartof><rights>Distributed under a Creative Commons Attribution 4.0 International License</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c394t-9965ce7eab6c5d0ff7c135f37d8c1c49bb251d606d28e73cbeb456108baeebac3</citedby><cites>FETCH-LOGICAL-c394t-9965ce7eab6c5d0ff7c135f37d8c1c49bb251d606d28e73cbeb456108baeebac3</cites><orcidid>0000-0003-2117-247X ; 0000-0001-7411-9125</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/6949118$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>230,314,776,780,792,881,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/6949118$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://hal.science/hal-01060827$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Poignant, Johann</creatorcontrib><creatorcontrib>Besacier, Laurent</creatorcontrib><creatorcontrib>Quénot, Georges</creatorcontrib><title>Unsupervised Speaker Identification in TV Broadcast Based on Written Names</title><title>IEEE transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>Identifying speakers in TV broadcast in an unsupervised way (i.e., without biometric models) is a solution for avoiding costly annotations. Existing methods usually use pronounced names, as a source of names, for identifying speech clusters provided by a diarization step but this source is too imprecise for having sufficient confidence. To overcome this issue, another source of names can be used: the names written in a title block in the image track. We first compared these two sources of names on their abilities to provide the name of the speakers in TV broadcast. This study shows that it is more interesting to use written names for their high precision for identifying the current speaker. We also propose two approaches for finding speaker identity based only on names written in the image track. With the "late naming" approach, we propose different propagations of written names onto clusters. Our second proposition, "Early naming," modifies the speaker diarization module (agglomerative clustering) by adding constraints preventing two clusters with different associated written names to be merged together. These methods were tested on the REPERE corpus phase 1, containing 3 hours of annotated videos. Our best "late naming" system reaches an F-measure of 73.1%. "early naming" improves over this result both in terms of identification error rate and of stability of the clustering stopping criterion. By comparison, a mono-modal, supervised speaker identification system with 535 speaker models trained on matching development data and additional TV and radio data only provided a 57.2% F-measure.</description><subject>Computation and Language</subject><subject>Computer Science</subject><subject>Document and Text Processing</subject><subject>Error analysis</subject><subject>IEEE transactions</subject><subject>Manuals</subject><subject>Multimodal fusion</subject><subject>speaker diarization</subject><subject>speaker identification</subject><subject>Speech</subject><subject>Speech processing</subject><subject>TV broadcast</subject><subject>TV broadcasting</subject><subject>Videos</subject><subject>written names</subject><issn>2329-9290</issn><issn>1558-7916</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2015</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kE1PwkAQhjdGEwnyB_TSq4fifrS73SMQFUyjJoAeN_sxjavQkt1K4r-3FeQ0k5n3mWQehK4JHhOC5d1qsixfxxSTbEwZFwWlZ2hAGZWpZDg7_--pxJdoFOMnxphgIaXIBuhpXcfvHYS9j-CS5Q70F4Rk4aBufeWtbn1TJ75OVm_JNDTaWR3bZKr7cLd4D75toU6e9RbiFbqo9CbC6FiHaP1wv5rN0_LlcTGblKllMmtTKXluQYA23OYOV5WwhOUVE66wxGbSGJoTxzF3tADBrAGT5ZzgwmgAoy0botvD3Q-9Ubvgtzr8qEZ7NZ-Uqp9133FcULEnXZYesjY0MQaoTgDBqpen_uSpXp46yuugmwPkAeAEcJlJQgr2C_PFazg</recordid><startdate>20150101</startdate><enddate>20150101</enddate><creator>Poignant, Johann</creator><creator>Besacier, Laurent</creator><creator>Quénot, Georges</creator><general>IEEE</general><general>Institute of Electrical and Electronics Engineers</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>1XC</scope><scope>VOOES</scope><orcidid>https://orcid.org/0000-0003-2117-247X</orcidid><orcidid>https://orcid.org/0000-0001-7411-9125</orcidid></search><sort><creationdate>20150101</creationdate><title>Unsupervised Speaker Identification in TV Broadcast Based on Written Names</title><author>Poignant, Johann ; Besacier, Laurent ; Quénot, Georges</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c394t-9965ce7eab6c5d0ff7c135f37d8c1c49bb251d606d28e73cbeb456108baeebac3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2015</creationdate><topic>Computation and Language</topic><topic>Computer Science</topic><topic>Document and Text Processing</topic><topic>Error analysis</topic><topic>IEEE transactions</topic><topic>Manuals</topic><topic>Multimodal fusion</topic><topic>speaker diarization</topic><topic>speaker identification</topic><topic>Speech</topic><topic>Speech processing</topic><topic>TV broadcast</topic><topic>TV broadcasting</topic><topic>Videos</topic><topic>written names</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Poignant, Johann</creatorcontrib><creatorcontrib>Besacier, Laurent</creatorcontrib><creatorcontrib>Quénot, Georges</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Hyper Article en Ligne (HAL)</collection><collection>Hyper Article en Ligne (HAL) (Open Access)</collection><jtitle>IEEE transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Poignant, Johann</au><au>Besacier, Laurent</au><au>Quénot, Georges</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Unsupervised Speaker Identification in TV Broadcast Based on Written Names</atitle><jtitle>IEEE transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2015-01-01</date><risdate>2015</risdate><volume>23</volume><issue>1</issue><spage>57</spage><epage>68</epage><pages>57-68</pages><issn>2329-9290</issn><issn>1558-7916</issn><eissn>2329-9304</eissn><coden>ITASD8</coden><abstract>Identifying speakers in TV broadcast in an unsupervised way (i.e., without biometric models) is a solution for avoiding costly annotations. Existing methods usually use pronounced names, as a source of names, for identifying speech clusters provided by a diarization step but this source is too imprecise for having sufficient confidence. To overcome this issue, another source of names can be used: the names written in a title block in the image track. We first compared these two sources of names on their abilities to provide the name of the speakers in TV broadcast. This study shows that it is more interesting to use written names for their high precision for identifying the current speaker. We also propose two approaches for finding speaker identity based only on names written in the image track. With the "late naming" approach, we propose different propagations of written names onto clusters. Our second proposition, "Early naming," modifies the speaker diarization module (agglomerative clustering) by adding constraints preventing two clusters with different associated written names to be merged together. These methods were tested on the REPERE corpus phase 1, containing 3 hours of annotated videos. Our best "late naming" system reaches an F-measure of 73.1%. "early naming" improves over this result both in terms of identification error rate and of stability of the clustering stopping criterion. By comparison, a mono-modal, supervised speaker identification system with 535 speaker models trained on matching development data and additional TV and radio data only provided a 57.2% F-measure.</abstract><pub>IEEE</pub><doi>10.1109/TASLP.2014.2367822</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0003-2117-247X</orcidid><orcidid>https://orcid.org/0000-0001-7411-9125</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 2329-9290
ispartof	IEEE transactions on audio, speech, and language processing, 2015-01, Vol.23 (1), p.57-68
issn	2329-9290 1558-7916 2329-9304
language	eng
recordid	cdi_ieee_primary_6949118
source	IEEE Electronic Library (IEL)
subjects	Computation and Language Computer Science Document and Text Processing Error analysis IEEE transactions Manuals Multimodal fusion speaker diarization speaker identification Speech Speech processing TV broadcast TV broadcasting Videos written names
title	Unsupervised Speaker Identification in TV Broadcast Based on Written Names
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T06%3A45%3A18IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-hal_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Unsupervised%20Speaker%20Identification%20in%20TV%20Broadcast%20Based%20on%20Written%20Names&rft.jtitle=IEEE%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Poignant,%20Johann&rft.date=2015-01-01&rft.volume=23&rft.issue=1&rft.spage=57&rft.epage=68&rft.pages=57-68&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASD8&rft_id=info:doi/10.1109/TASLP.2014.2367822&rft_dat=%3Chal_RIE%3Eoai_HAL_hal_01060827v1%3C/hal_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=6949118&rfr_iscdi=true