Segregation of Speakers for Speaker Adaptation in TV News Audio

Speaker adaptation is commonly used to compensate speaker variation in large vocabulary continuous speech recognition. In a multi-speaker environment where speakers change frequently speaker segregation is needed to divide the input audio stream to speaker turns. Speaker turns define the current spe...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Remes, U., Pylkkonen, J., Kurimo, M.
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page IV-484
container_issue
container_start_page IV-481
container_title
container_volume 4
creator Remes, U.
Pylkkonen, J.
Kurimo, M.
description Speaker adaptation is commonly used to compensate speaker variation in large vocabulary continuous speech recognition. In a multi-speaker environment where speakers change frequently speaker segregation is needed to divide the input audio stream to speaker turns. Speaker turns define the current speaker at each time and speaker adaptation can thus be done based on speaker turns. The novelty of this paper is that the speaker-specific transformations are estimated incrementally and in tandem with speaker segregation. Therefore we need a transformation that can be reliably estimated based on one speaker turn alone. We propose the constrained maximum likelihood linear regression (CMLLR) for this. In testing with Finnish TV news audio, speaker adaptation reduced the average letter error rate 25% relative to baseline.
doi_str_mv 10.1109/ICASSP.2007.366954
format Conference Proceeding
fullrecord <record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_4218142</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>4218142</ieee_id><sourcerecordid>4218142</sourcerecordid><originalsourceid>FETCH-LOGICAL-i219t-db8efd3d39f01e3bd5a65cbd8236b51da8c574464f347411d07ec280401049e03</originalsourceid><addsrcrecordid>eNpVj1tLw0AUhNcbWGv_gL7sH0g8Z_dkL08SijcoKqSKb2WTPSnx0oQkIv57leqDT8MwH8OMECcIKSL4s5t5XhT3qQKwqTbGZ7QjZt46JEUEVjmzKyZKW5-gh6e9f5n1-2KCmYLEIPlDcTQMzwDgLLmJOC943fM6jE27kW0ti47DC_eDrNv-z8g8hm7cIs1GLh_lLX8MMn-PTXssDurwOvDsV6fi4fJiOb9OFndX36MXSaPQj0ksHddRR-1rQNZlzILJqjI6pU2ZYQyuyiyRoVqTJcQIlivlgACBPIOeitNtb8PMq65v3kL_uSKFPzf1F-qJTSw</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Segregation of Speakers for Speaker Adaptation in TV News Audio</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Remes, U. ; Pylkkonen, J. ; Kurimo, M.</creator><creatorcontrib>Remes, U. ; Pylkkonen, J. ; Kurimo, M.</creatorcontrib><description>Speaker adaptation is commonly used to compensate speaker variation in large vocabulary continuous speech recognition. In a multi-speaker environment where speakers change frequently speaker segregation is needed to divide the input audio stream to speaker turns. Speaker turns define the current speaker at each time and speaker adaptation can thus be done based on speaker turns. The novelty of this paper is that the speaker-specific transformations are estimated incrementally and in tandem with speaker segregation. Therefore we need a transformation that can be reliably estimated based on one speaker turn alone. We propose the constrained maximum likelihood linear regression (CMLLR) for this. In testing with Finnish TV news audio, speaker adaptation reduced the average letter error rate 25% relative to baseline.</description><identifier>ISSN: 1520-6149</identifier><identifier>ISBN: 9781424407279</identifier><identifier>ISBN: 1424407273</identifier><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 9781424407286</identifier><identifier>EISBN: 1424407281</identifier><identifier>DOI: 10.1109/ICASSP.2007.366954</identifier><language>eng</language><publisher>IEEE</publisher><subject>Covariance matrix ; Informatics ; Maximum likelihood decoding ; Maximum likelihood linear regression ; Shape measurement ; speaker recognition ; Speech recognition ; Streaming media ; Testing ; Vocabulary</subject><ispartof>2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007, Vol.4, p.IV-481-IV-484</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/4218142$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,2058,27925,54920</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/4218142$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Remes, U.</creatorcontrib><creatorcontrib>Pylkkonen, J.</creatorcontrib><creatorcontrib>Kurimo, M.</creatorcontrib><title>Segregation of Speakers for Speaker Adaptation in TV News Audio</title><title>2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07</title><addtitle>ICASSP</addtitle><description>Speaker adaptation is commonly used to compensate speaker variation in large vocabulary continuous speech recognition. In a multi-speaker environment where speakers change frequently speaker segregation is needed to divide the input audio stream to speaker turns. Speaker turns define the current speaker at each time and speaker adaptation can thus be done based on speaker turns. The novelty of this paper is that the speaker-specific transformations are estimated incrementally and in tandem with speaker segregation. Therefore we need a transformation that can be reliably estimated based on one speaker turn alone. We propose the constrained maximum likelihood linear regression (CMLLR) for this. In testing with Finnish TV news audio, speaker adaptation reduced the average letter error rate 25% relative to baseline.</description><subject>Covariance matrix</subject><subject>Informatics</subject><subject>Maximum likelihood decoding</subject><subject>Maximum likelihood linear regression</subject><subject>Shape measurement</subject><subject>speaker recognition</subject><subject>Speech recognition</subject><subject>Streaming media</subject><subject>Testing</subject><subject>Vocabulary</subject><issn>1520-6149</issn><issn>2379-190X</issn><isbn>9781424407279</isbn><isbn>1424407273</isbn><isbn>9781424407286</isbn><isbn>1424407281</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2007</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNpVj1tLw0AUhNcbWGv_gL7sH0g8Z_dkL08SijcoKqSKb2WTPSnx0oQkIv57leqDT8MwH8OMECcIKSL4s5t5XhT3qQKwqTbGZ7QjZt46JEUEVjmzKyZKW5-gh6e9f5n1-2KCmYLEIPlDcTQMzwDgLLmJOC943fM6jE27kW0ti47DC_eDrNv-z8g8hm7cIs1GLh_lLX8MMn-PTXssDurwOvDsV6fi4fJiOb9OFndX36MXSaPQj0ksHddRR-1rQNZlzILJqjI6pU2ZYQyuyiyRoVqTJcQIlivlgACBPIOeitNtb8PMq65v3kL_uSKFPzf1F-qJTSw</recordid><startdate>200704</startdate><enddate>200704</enddate><creator>Remes, U.</creator><creator>Pylkkonen, J.</creator><creator>Kurimo, M.</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>200704</creationdate><title>Segregation of Speakers for Speaker Adaptation in TV News Audio</title><author>Remes, U. ; Pylkkonen, J. ; Kurimo, M.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i219t-db8efd3d39f01e3bd5a65cbd8236b51da8c574464f347411d07ec280401049e03</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2007</creationdate><topic>Covariance matrix</topic><topic>Informatics</topic><topic>Maximum likelihood decoding</topic><topic>Maximum likelihood linear regression</topic><topic>Shape measurement</topic><topic>speaker recognition</topic><topic>Speech recognition</topic><topic>Streaming media</topic><topic>Testing</topic><topic>Vocabulary</topic><toplevel>online_resources</toplevel><creatorcontrib>Remes, U.</creatorcontrib><creatorcontrib>Pylkkonen, J.</creatorcontrib><creatorcontrib>Kurimo, M.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Remes, U.</au><au>Pylkkonen, J.</au><au>Kurimo, M.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Segregation of Speakers for Speaker Adaptation in TV News Audio</atitle><btitle>2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07</btitle><stitle>ICASSP</stitle><date>2007-04</date><risdate>2007</risdate><volume>4</volume><spage>IV-481</spage><epage>IV-484</epage><pages>IV-481-IV-484</pages><issn>1520-6149</issn><eissn>2379-190X</eissn><isbn>9781424407279</isbn><isbn>1424407273</isbn><eisbn>9781424407286</eisbn><eisbn>1424407281</eisbn><abstract>Speaker adaptation is commonly used to compensate speaker variation in large vocabulary continuous speech recognition. In a multi-speaker environment where speakers change frequently speaker segregation is needed to divide the input audio stream to speaker turns. Speaker turns define the current speaker at each time and speaker adaptation can thus be done based on speaker turns. The novelty of this paper is that the speaker-specific transformations are estimated incrementally and in tandem with speaker segregation. Therefore we need a transformation that can be reliably estimated based on one speaker turn alone. We propose the constrained maximum likelihood linear regression (CMLLR) for this. In testing with Finnish TV news audio, speaker adaptation reduced the average letter error rate 25% relative to baseline.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP.2007.366954</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1520-6149
ispartof 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007, Vol.4, p.IV-481-IV-484
issn 1520-6149
2379-190X
language eng
recordid cdi_ieee_primary_4218142
source IEEE Electronic Library (IEL) Conference Proceedings
subjects Covariance matrix
Informatics
Maximum likelihood decoding
Maximum likelihood linear regression
Shape measurement
speaker recognition
Speech recognition
Streaming media
Testing
Vocabulary
title Segregation of Speakers for Speaker Adaptation in TV News Audio
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-20T10%3A04%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Segregation%20of%20Speakers%20for%20Speaker%20Adaptation%20in%20TV%20News%20Audio&rft.btitle=2007%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing%20-%20ICASSP%20'07&rft.au=Remes,%20U.&rft.date=2007-04&rft.volume=4&rft.spage=IV-481&rft.epage=IV-484&rft.pages=IV-481-IV-484&rft.issn=1520-6149&rft.eissn=2379-190X&rft.isbn=9781424407279&rft.isbn_list=1424407273&rft_id=info:doi/10.1109/ICASSP.2007.366954&rft_dat=%3Cieee_6IE%3E4218142%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=9781424407286&rft.eisbn_list=1424407281&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=4218142&rfr_iscdi=true