Segregation of Speakers for Speaker Adaptation in TV News Audio
Speaker adaptation is commonly used to compensate speaker variation in large vocabulary continuous speech recognition. In a multi-speaker environment where speakers change frequently speaker segregation is needed to divide the input audio stream to speaker turns. Speaker turns define the current spe...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | IV-484 |
---|---|
container_issue | |
container_start_page | IV-481 |
container_title | |
container_volume | 4 |
creator | Remes, U. Pylkkonen, J. Kurimo, M. |
description | Speaker adaptation is commonly used to compensate speaker variation in large vocabulary continuous speech recognition. In a multi-speaker environment where speakers change frequently speaker segregation is needed to divide the input audio stream to speaker turns. Speaker turns define the current speaker at each time and speaker adaptation can thus be done based on speaker turns. The novelty of this paper is that the speaker-specific transformations are estimated incrementally and in tandem with speaker segregation. Therefore we need a transformation that can be reliably estimated based on one speaker turn alone. We propose the constrained maximum likelihood linear regression (CMLLR) for this. In testing with Finnish TV news audio, speaker adaptation reduced the average letter error rate 25% relative to baseline. |
doi_str_mv | 10.1109/ICASSP.2007.366954 |
format | Conference Proceeding |
fullrecord | <record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_4218142</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>4218142</ieee_id><sourcerecordid>4218142</sourcerecordid><originalsourceid>FETCH-LOGICAL-i219t-db8efd3d39f01e3bd5a65cbd8236b51da8c574464f347411d07ec280401049e03</originalsourceid><addsrcrecordid>eNpVj1tLw0AUhNcbWGv_gL7sH0g8Z_dkL08SijcoKqSKb2WTPSnx0oQkIv57leqDT8MwH8OMECcIKSL4s5t5XhT3qQKwqTbGZ7QjZt46JEUEVjmzKyZKW5-gh6e9f5n1-2KCmYLEIPlDcTQMzwDgLLmJOC943fM6jE27kW0ti47DC_eDrNv-z8g8hm7cIs1GLh_lLX8MMn-PTXssDurwOvDsV6fi4fJiOb9OFndX36MXSaPQj0ksHddRR-1rQNZlzILJqjI6pU2ZYQyuyiyRoVqTJcQIlivlgACBPIOeitNtb8PMq65v3kL_uSKFPzf1F-qJTSw</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Segregation of Speakers for Speaker Adaptation in TV News Audio</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Remes, U. ; Pylkkonen, J. ; Kurimo, M.</creator><creatorcontrib>Remes, U. ; Pylkkonen, J. ; Kurimo, M.</creatorcontrib><description>Speaker adaptation is commonly used to compensate speaker variation in large vocabulary continuous speech recognition. In a multi-speaker environment where speakers change frequently speaker segregation is needed to divide the input audio stream to speaker turns. Speaker turns define the current speaker at each time and speaker adaptation can thus be done based on speaker turns. The novelty of this paper is that the speaker-specific transformations are estimated incrementally and in tandem with speaker segregation. Therefore we need a transformation that can be reliably estimated based on one speaker turn alone. We propose the constrained maximum likelihood linear regression (CMLLR) for this. In testing with Finnish TV news audio, speaker adaptation reduced the average letter error rate 25% relative to baseline.</description><identifier>ISSN: 1520-6149</identifier><identifier>ISBN: 9781424407279</identifier><identifier>ISBN: 1424407273</identifier><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 9781424407286</identifier><identifier>EISBN: 1424407281</identifier><identifier>DOI: 10.1109/ICASSP.2007.366954</identifier><language>eng</language><publisher>IEEE</publisher><subject>Covariance matrix ; Informatics ; Maximum likelihood decoding ; Maximum likelihood linear regression ; Shape measurement ; speaker recognition ; Speech recognition ; Streaming media ; Testing ; Vocabulary</subject><ispartof>2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007, Vol.4, p.IV-481-IV-484</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/4218142$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,2058,27925,54920</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/4218142$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Remes, U.</creatorcontrib><creatorcontrib>Pylkkonen, J.</creatorcontrib><creatorcontrib>Kurimo, M.</creatorcontrib><title>Segregation of Speakers for Speaker Adaptation in TV News Audio</title><title>2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07</title><addtitle>ICASSP</addtitle><description>Speaker adaptation is commonly used to compensate speaker variation in large vocabulary continuous speech recognition. In a multi-speaker environment where speakers change frequently speaker segregation is needed to divide the input audio stream to speaker turns. Speaker turns define the current speaker at each time and speaker adaptation can thus be done based on speaker turns. The novelty of this paper is that the speaker-specific transformations are estimated incrementally and in tandem with speaker segregation. Therefore we need a transformation that can be reliably estimated based on one speaker turn alone. We propose the constrained maximum likelihood linear regression (CMLLR) for this. In testing with Finnish TV news audio, speaker adaptation reduced the average letter error rate 25% relative to baseline.</description><subject>Covariance matrix</subject><subject>Informatics</subject><subject>Maximum likelihood decoding</subject><subject>Maximum likelihood linear regression</subject><subject>Shape measurement</subject><subject>speaker recognition</subject><subject>Speech recognition</subject><subject>Streaming media</subject><subject>Testing</subject><subject>Vocabulary</subject><issn>1520-6149</issn><issn>2379-190X</issn><isbn>9781424407279</isbn><isbn>1424407273</isbn><isbn>9781424407286</isbn><isbn>1424407281</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2007</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNpVj1tLw0AUhNcbWGv_gL7sH0g8Z_dkL08SijcoKqSKb2WTPSnx0oQkIv57leqDT8MwH8OMECcIKSL4s5t5XhT3qQKwqTbGZ7QjZt46JEUEVjmzKyZKW5-gh6e9f5n1-2KCmYLEIPlDcTQMzwDgLLmJOC943fM6jE27kW0ti47DC_eDrNv-z8g8hm7cIs1GLh_lLX8MMn-PTXssDurwOvDsV6fi4fJiOb9OFndX36MXSaPQj0ksHddRR-1rQNZlzILJqjI6pU2ZYQyuyiyRoVqTJcQIlivlgACBPIOeitNtb8PMq65v3kL_uSKFPzf1F-qJTSw</recordid><startdate>200704</startdate><enddate>200704</enddate><creator>Remes, U.</creator><creator>Pylkkonen, J.</creator><creator>Kurimo, M.</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>200704</creationdate><title>Segregation of Speakers for Speaker Adaptation in TV News Audio</title><author>Remes, U. ; Pylkkonen, J. ; Kurimo, M.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i219t-db8efd3d39f01e3bd5a65cbd8236b51da8c574464f347411d07ec280401049e03</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2007</creationdate><topic>Covariance matrix</topic><topic>Informatics</topic><topic>Maximum likelihood decoding</topic><topic>Maximum likelihood linear regression</topic><topic>Shape measurement</topic><topic>speaker recognition</topic><topic>Speech recognition</topic><topic>Streaming media</topic><topic>Testing</topic><topic>Vocabulary</topic><toplevel>online_resources</toplevel><creatorcontrib>Remes, U.</creatorcontrib><creatorcontrib>Pylkkonen, J.</creatorcontrib><creatorcontrib>Kurimo, M.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Remes, U.</au><au>Pylkkonen, J.</au><au>Kurimo, M.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Segregation of Speakers for Speaker Adaptation in TV News Audio</atitle><btitle>2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07</btitle><stitle>ICASSP</stitle><date>2007-04</date><risdate>2007</risdate><volume>4</volume><spage>IV-481</spage><epage>IV-484</epage><pages>IV-481-IV-484</pages><issn>1520-6149</issn><eissn>2379-190X</eissn><isbn>9781424407279</isbn><isbn>1424407273</isbn><eisbn>9781424407286</eisbn><eisbn>1424407281</eisbn><abstract>Speaker adaptation is commonly used to compensate speaker variation in large vocabulary continuous speech recognition. In a multi-speaker environment where speakers change frequently speaker segregation is needed to divide the input audio stream to speaker turns. Speaker turns define the current speaker at each time and speaker adaptation can thus be done based on speaker turns. The novelty of this paper is that the speaker-specific transformations are estimated incrementally and in tandem with speaker segregation. Therefore we need a transformation that can be reliably estimated based on one speaker turn alone. We propose the constrained maximum likelihood linear regression (CMLLR) for this. In testing with Finnish TV news audio, speaker adaptation reduced the average letter error rate 25% relative to baseline.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP.2007.366954</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1520-6149 |
ispartof | 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007, Vol.4, p.IV-481-IV-484 |
issn | 1520-6149 2379-190X |
language | eng |
recordid | cdi_ieee_primary_4218142 |
source | IEEE Electronic Library (IEL) Conference Proceedings |
subjects | Covariance matrix Informatics Maximum likelihood decoding Maximum likelihood linear regression Shape measurement speaker recognition Speech recognition Streaming media Testing Vocabulary |
title | Segregation of Speakers for Speaker Adaptation in TV News Audio |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-20T10%3A04%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Segregation%20of%20Speakers%20for%20Speaker%20Adaptation%20in%20TV%20News%20Audio&rft.btitle=2007%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing%20-%20ICASSP%20'07&rft.au=Remes,%20U.&rft.date=2007-04&rft.volume=4&rft.spage=IV-481&rft.epage=IV-484&rft.pages=IV-481-IV-484&rft.issn=1520-6149&rft.eissn=2379-190X&rft.isbn=9781424407279&rft.isbn_list=1424407273&rft_id=info:doi/10.1109/ICASSP.2007.366954&rft_dat=%3Cieee_6IE%3E4218142%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&rft.eisbn=9781424407286&rft.eisbn_list=1424407281&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=4218142&rfr_iscdi=true |