Frequency-domain linear prediction for temporal features

Current speech recognition systems uniformly employ short-time spectral analysis, usually over windows of 10-30 ms, as the basis for their acoustic representations. Any detail below this timescale is lost, and even temporal structures above this level are usually only weakly represented in the form...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Athineos, M., Ellis, D.P.W.
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Acoustic testing Automatic speech recognition Discrete cosine transforms Error analysis Frequency domain analysis Linear predictive coding Predictive models Spectral analysis Speech recognition Telephony
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	266
container_issue
container_start_page	261
container_title
container_volume
creator	Athineos, M. Ellis, D.P.W.
description	Current speech recognition systems uniformly employ short-time spectral analysis, usually over windows of 10-30 ms, as the basis for their acoustic representations. Any detail below this timescale is lost, and even temporal structures above this level are usually only weakly represented in the form of deltas etc. We address this limitation by proposing a novel representation of the temporal envelope in different frequency bands by exploring the dual of conventional linear prediction (LPC) when applied in the transform domain. With this technique of frequency-domain linear prediction (FDLP), the 'poles' of the model describe temporal, rather than spectral, peaks. By using analysis windows on the order of hundreds of milliseconds, the procedure automatically decides how to distribute poles to model the temporal structure best within the window. While this approach offers many possibilities for novel speech features, we experiment with one particular form, an index describing the 'sharpness' of individual poles within a window, and show a relatively large word error rate improvement from 4.97% to 3.81% in a recognizer trained on general conversational telephone speech and tested on a small-vocabulary spontaneous numbers task. We analyze this improvement in terms of the confusion matrices and suggest how the newly-modeled fine temporal structure may be helping.
doi_str_mv	10.1109/ASRU.2003.1318451
format	Conference Proceeding
fullrecord	<record><control><sourceid>ieee_6IE</sourceid><recordid>TN_cdi_ieee_primary_1318451</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>1318451</ieee_id><sourcerecordid>1318451</sourcerecordid><originalsourceid>FETCH-LOGICAL-i218t-9000bb7227d63ea9fbba3e3dbcb7910ef3986603f784d10d757f616bb86975043</originalsourceid><addsrcrecordid>eNotj81KxDAUhQMiKGMfQNzkBVpvetP8LIfBUWFAUGc9JM0NRPpn2lnM21twDge-3fk4jD0KqIQA-7z9-jxWNQBWAoWRjbhhhdUG1qK2BuCOFfP8A2tkIxWoe2b2mX7PNLSXMoy9SwPv0kAu8ylTSO2SxoHHMfOF-mnMruOR3HLOND-w2-i6mYorN-y4f_nevZWHj9f33fZQplqYpbSry3td1zooJGej9w4Jg2-9tgIoojVKAUZtZBAQdKOjEsp7o6xuQOKGPf3vJiI6TTn1Ll9O13_4B0NhRYM</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Frequency-domain linear prediction for temporal features</title><source>IEEE Electronic Library (IEL) Conference Proceedings</source><creator>Athineos, M. ; Ellis, D.P.W.</creator><creatorcontrib>Athineos, M. ; Ellis, D.P.W.</creatorcontrib><description>Current speech recognition systems uniformly employ short-time spectral analysis, usually over windows of 10-30 ms, as the basis for their acoustic representations. Any detail below this timescale is lost, and even temporal structures above this level are usually only weakly represented in the form of deltas etc. We address this limitation by proposing a novel representation of the temporal envelope in different frequency bands by exploring the dual of conventional linear prediction (LPC) when applied in the transform domain. With this technique of frequency-domain linear prediction (FDLP), the 'poles' of the model describe temporal, rather than spectral, peaks. By using analysis windows on the order of hundreds of milliseconds, the procedure automatically decides how to distribute poles to model the temporal structure best within the window. While this approach offers many possibilities for novel speech features, we experiment with one particular form, an index describing the 'sharpness' of individual poles within a window, and show a relatively large word error rate improvement from 4.97% to 3.81% in a recognizer trained on general conversational telephone speech and tested on a small-vocabulary spontaneous numbers task. We analyze this improvement in terms of the confusion matrices and suggest how the newly-modeled fine temporal structure may be helping.</description><identifier>ISBN: 9780780379800</identifier><identifier>ISBN: 0780379802</identifier><identifier>DOI: 10.1109/ASRU.2003.1318451</identifier><language>eng</language><publisher>IEEE</publisher><subject>Acoustic testing ; Automatic speech recognition ; Discrete cosine transforms ; Error analysis ; Frequency domain analysis ; Linear predictive coding ; Predictive models ; Spectral analysis ; Speech recognition ; Telephony</subject><ispartof>2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721), 2003, p.261-266</ispartof><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/1318451$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,2052,4036,4037,27904,54898</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/1318451$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Athineos, M.</creatorcontrib><creatorcontrib>Ellis, D.P.W.</creatorcontrib><title>Frequency-domain linear prediction for temporal features</title><title>2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721)</title><addtitle>ASRU</addtitle><description>Current speech recognition systems uniformly employ short-time spectral analysis, usually over windows of 10-30 ms, as the basis for their acoustic representations. Any detail below this timescale is lost, and even temporal structures above this level are usually only weakly represented in the form of deltas etc. We address this limitation by proposing a novel representation of the temporal envelope in different frequency bands by exploring the dual of conventional linear prediction (LPC) when applied in the transform domain. With this technique of frequency-domain linear prediction (FDLP), the 'poles' of the model describe temporal, rather than spectral, peaks. By using analysis windows on the order of hundreds of milliseconds, the procedure automatically decides how to distribute poles to model the temporal structure best within the window. While this approach offers many possibilities for novel speech features, we experiment with one particular form, an index describing the 'sharpness' of individual poles within a window, and show a relatively large word error rate improvement from 4.97% to 3.81% in a recognizer trained on general conversational telephone speech and tested on a small-vocabulary spontaneous numbers task. We analyze this improvement in terms of the confusion matrices and suggest how the newly-modeled fine temporal structure may be helping.</description><subject>Acoustic testing</subject><subject>Automatic speech recognition</subject><subject>Discrete cosine transforms</subject><subject>Error analysis</subject><subject>Frequency domain analysis</subject><subject>Linear predictive coding</subject><subject>Predictive models</subject><subject>Spectral analysis</subject><subject>Speech recognition</subject><subject>Telephony</subject><isbn>9780780379800</isbn><isbn>0780379802</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2003</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><sourceid>RIE</sourceid><recordid>eNotj81KxDAUhQMiKGMfQNzkBVpvetP8LIfBUWFAUGc9JM0NRPpn2lnM21twDge-3fk4jD0KqIQA-7z9-jxWNQBWAoWRjbhhhdUG1qK2BuCOFfP8A2tkIxWoe2b2mX7PNLSXMoy9SwPv0kAu8ylTSO2SxoHHMfOF-mnMruOR3HLOND-w2-i6mYorN-y4f_nevZWHj9f33fZQplqYpbSry3td1zooJGej9w4Jg2-9tgIoojVKAUZtZBAQdKOjEsp7o6xuQOKGPf3vJiI6TTn1Ll9O13_4B0NhRYM</recordid><startdate>2003</startdate><enddate>2003</enddate><creator>Athineos, M.</creator><creator>Ellis, D.P.W.</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>2003</creationdate><title>Frequency-domain linear prediction for temporal features</title><author>Athineos, M. ; Ellis, D.P.W.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i218t-9000bb7227d63ea9fbba3e3dbcb7910ef3986603f784d10d757f616bb86975043</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2003</creationdate><topic>Acoustic testing</topic><topic>Automatic speech recognition</topic><topic>Discrete cosine transforms</topic><topic>Error analysis</topic><topic>Frequency domain analysis</topic><topic>Linear predictive coding</topic><topic>Predictive models</topic><topic>Spectral analysis</topic><topic>Speech recognition</topic><topic>Telephony</topic><toplevel>online_resources</toplevel><creatorcontrib>Athineos, M.</creatorcontrib><creatorcontrib>Ellis, D.P.W.</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Athineos, M.</au><au>Ellis, D.P.W.</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Frequency-domain linear prediction for temporal features</atitle><btitle>2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721)</btitle><stitle>ASRU</stitle><date>2003</date><risdate>2003</risdate><spage>261</spage><epage>266</epage><pages>261-266</pages><isbn>9780780379800</isbn><isbn>0780379802</isbn><abstract>Current speech recognition systems uniformly employ short-time spectral analysis, usually over windows of 10-30 ms, as the basis for their acoustic representations. Any detail below this timescale is lost, and even temporal structures above this level are usually only weakly represented in the form of deltas etc. We address this limitation by proposing a novel representation of the temporal envelope in different frequency bands by exploring the dual of conventional linear prediction (LPC) when applied in the transform domain. With this technique of frequency-domain linear prediction (FDLP), the 'poles' of the model describe temporal, rather than spectral, peaks. By using analysis windows on the order of hundreds of milliseconds, the procedure automatically decides how to distribute poles to model the temporal structure best within the window. While this approach offers many possibilities for novel speech features, we experiment with one particular form, an index describing the 'sharpness' of individual poles within a window, and show a relatively large word error rate improvement from 4.97% to 3.81% in a recognizer trained on general conversational telephone speech and tested on a small-vocabulary spontaneous numbers task. We analyze this improvement in terms of the confusion matrices and suggest how the newly-modeled fine temporal structure may be helping.</abstract><pub>IEEE</pub><doi>10.1109/ASRU.2003.1318451</doi><tpages>6</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISBN: 9780780379800
ispartof	2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721), 2003, p.261-266
issn
language	eng
recordid	cdi_ieee_primary_1318451
source	IEEE Electronic Library (IEL) Conference Proceedings
subjects	Acoustic testing Automatic speech recognition Discrete cosine transforms Error analysis Frequency domain analysis Linear predictive coding Predictive models Spectral analysis Speech recognition Telephony
title	Frequency-domain linear prediction for temporal features
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-27T18%3A07%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_6IE&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Frequency-domain%20linear%20prediction%20for%20temporal%20features&rft.btitle=2003%20IEEE%20Workshop%20on%20Automatic%20Speech%20Recognition%20and%20Understanding%20(IEEE%20Cat.%20No.03EX721)&rft.au=Athineos,%20M.&rft.date=2003&rft.spage=261&rft.epage=266&rft.pages=261-266&rft.isbn=9780780379800&rft.isbn_list=0780379802&rft_id=info:doi/10.1109/ASRU.2003.1318451&rft_dat=%3Cieee_6IE%3E1318451%3C/ieee_6IE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=1318451&rfr_iscdi=true