Segregation of Speakers for Speaker Adaptation in TV News Audio

Speaker adaptation is commonly used to compensate speaker variation in large vocabulary continuous speech recognition. In a multi-speaker environment where speakers change frequently speaker segregation is needed to divide the input audio stream to speaker turns. Speaker turns define the current spe...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Remes, U., Pylkkonen, J., Kurimo, M.
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Covariance matrix Informatics Maximum likelihood decoding Maximum likelihood linear regression Shape measurement speaker recognition Speech recognition Streaming media Testing Vocabulary
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	IV-484
container_issue
container_start_page	IV-481
container_title
container_volume	4
creator	Remes, U. Pylkkonen, J. Kurimo, M.
description	Speaker adaptation is commonly used to compensate speaker variation in large vocabulary continuous speech recognition. In a multi-speaker environment where speakers change frequently speaker segregation is needed to divide the input audio stream to speaker turns. Speaker turns define the current speaker at each time and speaker adaptation can thus be done based on speaker turns. The novelty of this paper is that the speaker-specific transformations are estimated incrementally and in tandem with speaker segregation. Therefore we need a transformation that can be reliably estimated based on one speaker turn alone. We propose the constrained maximum likelihood linear regression (CMLLR) for this. In testing with Finnish TV news audio, speaker adaptation reduced the average letter error rate 25% relative to baseline.
doi_str_mv	10.1109/ICASSP.2007.366954
format	Conference Proceeding
fullrecord	<record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_4218142</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>4218142</ieee_id><sourcerecordid>4218142</sourcerecordid><originalsourceid>FETCH-LOGICAL-i219t-db8efd3d39f01e3bd5a65cbd8236b51da8c574464f347411d07ec280401049e03</originalsourceid><addsrcrecordid>eNpVj1tLw0AUhNcbWGv_gL7sH0g8Z_dkL08SijcoKqSKb2WTPSnx0oQkIv57leqDT8MwH8OMECcIKSL4s5t5XhT3qQKwqTbGZ7QjZt46JEUEVjmzKyZKW5-gh6e9f5n1-2KCmYLEIPlDcTQMzwDgLLmJOC943fM6jE27kW0ti47DC_eDrNv-z8g8hm7cIs1GLh_lLX8MMn-PTXssDurwOvDsV6fi4fJiOb9OFndX36MXSaPQj0ksHddRR-1rQNZlzILJqjI6pU2ZYQyuyiyRoVqTJcQIlivlgACBPIOeitNtb8PMq65v3kL_uSKFPzf1F-qJTSw</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Segregation of Speakers for Speaker Adaptation in TV News Audio</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Remes, U. ; Pylkkonen, J. ; Kurimo, M.</creator><creatorcontrib>Remes, U. ; Pylkkonen, J. ; Kurimo, M.</creatorcontrib><description>Speaker adaptation is commonly used to compensate speaker variation in large vocabulary continuous speech recognition. In a multi-speaker environment where speakers change frequently speaker segregation is needed to divide the input audio stream to speaker turns. Speaker turns define the current speaker at each time and speaker adaptation can thus be done based on speaker turns. The novelty of this paper is that the speaker-specific transformations are estimated incrementally and in tandem with speaker segregation. Therefore we need a transformation that can be reliably estimated based on one speaker turn alone. We propose the constrained maximum likelihood linear regression (CMLLR) for this. In testing with Finnish TV news audio, speaker adaptation reduced the average letter error rate 25% relative to baseline.</description><identifier>ISSN: 1520-6149</identifier><identifier>ISBN: 9781424407279</identifier><identifier>ISBN: 1424407273</identifier><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 9781424407286</identifier><identifier>EISBN: 1424407281</identifier><identifier>DOI: 10.1109/ICASSP.2007.366954</identifier><language>eng</language><publisher>IEEE</publisher><subject>Covariance matrix ; Informatics ; Maximum likelihood decoding ; Maximum likelihood linear regression ; Shape measurement ; speaker recognition ; Speech recognition ; Streaming media ; Testing ; Vocabulary</subject><ispartof>2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007, Vol.4, p.IV-481-IV-484</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/4218142$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,2058,27925,54920</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/4218142$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Remes, U.</creatorcontrib><creatorcontrib>Pylkkonen, J.</creatorcontrib><creatorcontrib>Kurimo, M.</creatorcontrib><title>Segregation of Speakers for Speaker Adaptation in TV News Audio</title><title>2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07</title><addtitle>ICASSP</addtitle><description>Speaker adaptation is commonly used to compensate speaker variation in large vocabulary continuous speech recognition. In a multi-speaker environment where speakers change frequently speaker segregation is needed to divide the input audio stream to speaker turns. Speaker turns define the current speaker at each time and speaker adaptation can thus be done based on speaker turns. The novelty of this paper is that the speaker-specific transformations are estimated incrementally and in tandem with speaker segregation. Therefore we need a transformation that can be reliably estimated based on one speaker turn alone. We propose the constrained maximum likelihood linear regression (CMLLR) for this. In testing with Finnish TV news audio, speaker adaptation reduced the average letter error rate 25% relative to baseline.</description><subject>Covariance matrix</subject><subject>Informatics</subject><subject>Maximum likelihood decoding</subject><subject>Maximum likelihood linear regression</subject><subject>Shape measurement</subject><subject>speaker recognition</subject><subject>Speech recognition</subject><subject>Streaming media</subject><subject>Testing</subject><subject>Vocabulary</subject><issn>1520-6149</issn><issn>2379-190X</issn><isbn>9781424407279</isbn><isbn>1424407273</isbn><isbn>9781424407286</isbn><isbn>1424407281</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2007</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNpVj1tLw0AUhNcbWGv_gL7sH0g8Z_dkL08SijcoKqSKb2WTPSnx0oQkIv57leqDT8MwH8OMECcIKSL4s5t5XhT3qQKwqTbGZ7QjZt46JEUEVjmzKyZKW5-gh6e9f5n1-2KCmYLEIPlDcTQMzwDgLLmJOC943fM6jE27kW0ti47DC_eDrNv-z8g8hm7cIs1GLh_lLX8MMn-PTXssDurwOvDsV6fi4fJiOb9OFndX36MXSaPQj0ksHddRR-1rQNZlzILJqjI6pU2ZYQyuyiyRoVqTJcQIlivlgACBPIOeitNtb8PMq65v3kL_uSKFPzf1F-qJTSw</recordid><startdate>200704</startdate><enddate>200704</enddate><creator>Remes, U.</creator><creator>Pylkkonen, J.</creator><creator>Kurimo, M.</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>200704</creationdate><title>Segregation of Speakers for Speaker Adaptation in TV News Audio</title><author>Remes, U. ; Pylkkonen, J. ; Kurimo, M.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i219t-db8efd3d39f01e3bd5a65cbd8236b51da8c574464f347411d07ec280401049e03</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2007</creationdate><topic>Covariance matrix</topic><topic>Informatics</topic><topic>Maximum likelihood decoding</topic><topic>Maximum likelihood linear regression</topic><topic>Shape measurement</topic><topic>speaker recognition</topic><topic>Speech recognition</topic><topic>Streaming media</topic><topic>Testing</topic><topic>Vocabulary</topic><toplevel>online_resources</toplevel><creatorcontrib>Remes, U.</creatorcontrib><creatorcontrib>Pylkkonen, J.</creatorcontrib><creatorcontrib>Kurimo, M.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Remes, U.</au><au>Pylkkonen, J.</au><au>Kurimo, M.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Segregation of Speakers for Speaker Adaptation in TV News Audio</atitle><btitle>2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07</btitle><stitle>ICASSP</stitle><date>2007-04</date><risdate>2007</risdate><volume>4</volume><spage>IV-481</spage><epage>IV-484</epage><pages>IV-481-IV-484</pages><issn>1520-6149</issn><eissn>2379-190X</eissn><isbn>9781424407279</isbn><isbn>1424407273</isbn><eisbn>9781424407286</eisbn><eisbn>1424407281</eisbn><abstract>Speaker adaptation is commonly used to compensate speaker variation in large vocabulary continuous speech recognition. In a multi-speaker environment where speakers change frequently speaker segregation is needed to divide the input audio stream to speaker turns. Speaker turns define the current speaker at each time and speaker adaptation can thus be done based on speaker turns. The novelty of this paper is that the speaker-specific transformations are estimated incrementally and in tandem with speaker segregation. Therefore we need a transformation that can be reliably estimated based on one speaker turn alone. We propose the constrained maximum likelihood linear regression (CMLLR) for this. In testing with Finnish TV news audio, speaker adaptation reduced the average letter error rate 25% relative to baseline.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP.2007.366954</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1520-6149
ispartof	2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007, Vol.4, p.IV-481-IV-484
issn	1520-6149 2379-190X
language	eng
recordid	cdi_ieee_primary_4218142
source	IEEE Electronic Library (IEL) Conference Proceedings
subjects	Covariance matrix Informatics Maximum likelihood decoding Maximum likelihood linear regression Shape measurement speaker recognition Speech recognition Streaming media Testing Vocabulary
title	Segregation of Speakers for Speaker Adaptation in TV News Audio
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-20T10%3A04%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Segregation%20of%20Speakers%20for%20Speaker%20Adaptation%20in%20TV%20News%20Audio&rft.btitle=2007%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing%20-%20ICASSP%20'07&rft.au=Remes,%20U.&rft.date=2007-04&rft.volume=4&rft.spage=IV-481&rft.epage=IV-484&rft.pages=IV-481-IV-484&rft.issn=1520-6149&rft.eissn=2379-190X&rft.isbn=9781424407279&rft.isbn_list=1424407273&rft_id=info:doi/10.1109/ICASSP.2007.366954&rft_dat=%3Cieee_6IE%3E4218142%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=9781424407286&rft.eisbn_list=1424407281&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=4218142&rfr_iscdi=true