An Automatic Lipreading System for Spoken Digits With Limited Training Data

It is well known that visual cues of lip movement contain important speech relevant information. This paper presents an automatic lipreading system for small vocabulary speech recognition tasks. Using the lip segmentation and modeling techniques we developed earlier, we obtain a visual feature vecto...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2008-12, Vol.18 (12), p.1760-1765
Hauptverfasser: Wang, S.L., Liew, A.W.C., Lau, W.H., Leung, S.H.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1765
container_issue 12
container_start_page 1760
container_title IEEE transactions on circuits and systems for video technology
container_volume 18
creator Wang, S.L.
Liew, A.W.C.
Lau, W.H.
Leung, S.H.
description It is well known that visual cues of lip movement contain important speech relevant information. This paper presents an automatic lipreading system for small vocabulary speech recognition tasks. Using the lip segmentation and modeling techniques we developed earlier, we obtain a visual feature vector composed of outer and inner mouth features from the lip image sequence for recognition. A spline representation is employed to transform the discrete-time sampled features from the video frames into the continuous domain. The spline coefficients in the same word class are constrained to have similar expression and are estimated from the training data by the EM algorithm. For the multiple-speaker/speaker-independent recognition task, an adaptive multimodel approach is proposed to handle the variations caused by various talking styles. After building the appropriate word models from the spline coefficients, a maximum likelihood classification approach is taken for the recognition. Lip image sequences of English digits from 0 to 9 have been collected for the recognition test. Two widely used classification methods, HMM and RDA, have been adopted for comparison and the results demonstrate that the proposed algorithm deliver the best performance among these methods.
doi_str_mv 10.1109/TCSVT.2008.2004924
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_912282601</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>4625976</ieee_id><sourcerecordid>34506335</sourcerecordid><originalsourceid>FETCH-LOGICAL-c399t-75a5d9d2e996447c095b30ed903c50b5c1b06a8183005de68b97073a32c3f2183</originalsourceid><addsrcrecordid>eNpdkMtOwzAQRS0EEqXwA7CJkGCXMrbjJF5WLS9RiUUDLC3HcYpLXtjOon9PQisWbGZGM2euri5ClxhmGAO_yxbr92xGANKxRJxER2iCGUtDQoAdDzMwHKYEs1N05twWAEdplEzQy7wJ5r1va-mNClams1oWptkE653zug7K1gbrrv3STbA0G-Nd8GH85wDWxusiyKw0zYgvpZfn6KSUldMXhz5Fbw_32eIpXL0-Pi_mq1BRzn2YMMkKXhDNeRxFiQLOcgq64EAVg5wpnEMsU5xSAFboOM15AgmVlChakmE9Rbd73c623712XtTGKV1VstFt7wSNGMSUsgG8_gdu2942gzfBMSEpiQEPENlDyrbOWV2Kzppa2p3AIMZwxW-4YgxXHMIdnm4OytIpWZVWNsq4v08CHCccRgdXe85orf_OUUwYT2L6Ay0qgG8</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>912282601</pqid></control><display><type>article</type><title>An Automatic Lipreading System for Spoken Digits With Limited Training Data</title><source>IEEE Xplore</source><creator>Wang, S.L. ; Liew, A.W.C. ; Lau, W.H. ; Leung, S.H.</creator><creatorcontrib>Wang, S.L. ; Liew, A.W.C. ; Lau, W.H. ; Leung, S.H.</creatorcontrib><description>It is well known that visual cues of lip movement contain important speech relevant information. This paper presents an automatic lipreading system for small vocabulary speech recognition tasks. Using the lip segmentation and modeling techniques we developed earlier, we obtain a visual feature vector composed of outer and inner mouth features from the lip image sequence for recognition. A spline representation is employed to transform the discrete-time sampled features from the video frames into the continuous domain. The spline coefficients in the same word class are constrained to have similar expression and are estimated from the training data by the EM algorithm. For the multiple-speaker/speaker-independent recognition task, an adaptive multimodel approach is proposed to handle the variations caused by various talking styles. After building the appropriate word models from the spline coefficients, a maximum likelihood classification approach is taken for the recognition. Lip image sequences of English digits from 0 to 9 have been collected for the recognition test. Two widely used classification methods, HMM and RDA, have been adopted for comparison and the results demonstrate that the proposed algorithm deliver the best performance among these methods.</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2008.2004924</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York, NY: IEEE</publisher><subject>Algorithms ; Applied sciences ; Discrete transforms ; Exact sciences and technology ; Hidden Markov models ; Image recognition ; Image segmentation ; Image sequences ; Information, signal and communications theory ; Lipreading ; Miscellaneous ; Mouth ; Pattern recognition ; Signal processing ; Speech processing ; Speech recognition ; Spline ; Studies ; Telecommunications and information theory ; Training data ; visual feature extraction ; visual speech recognition ; Vocabulary</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2008-12, Vol.18 (12), p.1760-1765</ispartof><rights>2009 INIST-CNRS</rights><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2008</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c399t-75a5d9d2e996447c095b30ed903c50b5c1b06a8183005de68b97073a32c3f2183</citedby><cites>FETCH-LOGICAL-c399t-75a5d9d2e996447c095b30ed903c50b5c1b06a8183005de68b97073a32c3f2183</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/4625976$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/4625976$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttp://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&amp;idt=20917905$$DView record in Pascal Francis$$Hfree_for_read</backlink></links><search><creatorcontrib>Wang, S.L.</creatorcontrib><creatorcontrib>Liew, A.W.C.</creatorcontrib><creatorcontrib>Lau, W.H.</creatorcontrib><creatorcontrib>Leung, S.H.</creatorcontrib><title>An Automatic Lipreading System for Spoken Digits With Limited Training Data</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>It is well known that visual cues of lip movement contain important speech relevant information. This paper presents an automatic lipreading system for small vocabulary speech recognition tasks. Using the lip segmentation and modeling techniques we developed earlier, we obtain a visual feature vector composed of outer and inner mouth features from the lip image sequence for recognition. A spline representation is employed to transform the discrete-time sampled features from the video frames into the continuous domain. The spline coefficients in the same word class are constrained to have similar expression and are estimated from the training data by the EM algorithm. For the multiple-speaker/speaker-independent recognition task, an adaptive multimodel approach is proposed to handle the variations caused by various talking styles. After building the appropriate word models from the spline coefficients, a maximum likelihood classification approach is taken for the recognition. Lip image sequences of English digits from 0 to 9 have been collected for the recognition test. Two widely used classification methods, HMM and RDA, have been adopted for comparison and the results demonstrate that the proposed algorithm deliver the best performance among these methods.</description><subject>Algorithms</subject><subject>Applied sciences</subject><subject>Discrete transforms</subject><subject>Exact sciences and technology</subject><subject>Hidden Markov models</subject><subject>Image recognition</subject><subject>Image segmentation</subject><subject>Image sequences</subject><subject>Information, signal and communications theory</subject><subject>Lipreading</subject><subject>Miscellaneous</subject><subject>Mouth</subject><subject>Pattern recognition</subject><subject>Signal processing</subject><subject>Speech processing</subject><subject>Speech recognition</subject><subject>Spline</subject><subject>Studies</subject><subject>Telecommunications and information theory</subject><subject>Training data</subject><subject>visual feature extraction</subject><subject>visual speech recognition</subject><subject>Vocabulary</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2008</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkMtOwzAQRS0EEqXwA7CJkGCXMrbjJF5WLS9RiUUDLC3HcYpLXtjOon9PQisWbGZGM2euri5ClxhmGAO_yxbr92xGANKxRJxER2iCGUtDQoAdDzMwHKYEs1N05twWAEdplEzQy7wJ5r1va-mNClams1oWptkE653zug7K1gbrrv3STbA0G-Nd8GH85wDWxusiyKw0zYgvpZfn6KSUldMXhz5Fbw_32eIpXL0-Pi_mq1BRzn2YMMkKXhDNeRxFiQLOcgq64EAVg5wpnEMsU5xSAFboOM15AgmVlChakmE9Rbd73c623712XtTGKV1VstFt7wSNGMSUsgG8_gdu2942gzfBMSEpiQEPENlDyrbOWV2Kzppa2p3AIMZwxW-4YgxXHMIdnm4OytIpWZVWNsq4v08CHCccRgdXe85orf_OUUwYT2L6Ay0qgG8</recordid><startdate>20081201</startdate><enddate>20081201</enddate><creator>Wang, S.L.</creator><creator>Liew, A.W.C.</creator><creator>Lau, W.H.</creator><creator>Leung, S.H.</creator><general>IEEE</general><general>Institute of Electrical and Electronics Engineers</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>IQODW</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>F28</scope><scope>FR3</scope></search><sort><creationdate>20081201</creationdate><title>An Automatic Lipreading System for Spoken Digits With Limited Training Data</title><author>Wang, S.L. ; Liew, A.W.C. ; Lau, W.H. ; Leung, S.H.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c399t-75a5d9d2e996447c095b30ed903c50b5c1b06a8183005de68b97073a32c3f2183</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2008</creationdate><topic>Algorithms</topic><topic>Applied sciences</topic><topic>Discrete transforms</topic><topic>Exact sciences and technology</topic><topic>Hidden Markov models</topic><topic>Image recognition</topic><topic>Image segmentation</topic><topic>Image sequences</topic><topic>Information, signal and communications theory</topic><topic>Lipreading</topic><topic>Miscellaneous</topic><topic>Mouth</topic><topic>Pattern recognition</topic><topic>Signal processing</topic><topic>Speech processing</topic><topic>Speech recognition</topic><topic>Spline</topic><topic>Studies</topic><topic>Telecommunications and information theory</topic><topic>Training data</topic><topic>visual feature extraction</topic><topic>visual speech recognition</topic><topic>Vocabulary</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wang, S.L.</creatorcontrib><creatorcontrib>Liew, A.W.C.</creatorcontrib><creatorcontrib>Lau, W.H.</creatorcontrib><creatorcontrib>Leung, S.H.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998–Present</collection><collection>IEEE Xplore</collection><collection>Pascal-Francis</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ANTE: Abstracts in New Technology &amp; Engineering</collection><collection>Engineering Research Database</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, S.L.</au><au>Liew, A.W.C.</au><au>Lau, W.H.</au><au>Leung, S.H.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An Automatic Lipreading System for Spoken Digits With Limited Training Data</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2008-12-01</date><risdate>2008</risdate><volume>18</volume><issue>12</issue><spage>1760</spage><epage>1765</epage><pages>1760-1765</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>It is well known that visual cues of lip movement contain important speech relevant information. This paper presents an automatic lipreading system for small vocabulary speech recognition tasks. Using the lip segmentation and modeling techniques we developed earlier, we obtain a visual feature vector composed of outer and inner mouth features from the lip image sequence for recognition. A spline representation is employed to transform the discrete-time sampled features from the video frames into the continuous domain. The spline coefficients in the same word class are constrained to have similar expression and are estimated from the training data by the EM algorithm. For the multiple-speaker/speaker-independent recognition task, an adaptive multimodel approach is proposed to handle the variations caused by various talking styles. After building the appropriate word models from the spline coefficients, a maximum likelihood classification approach is taken for the recognition. Lip image sequences of English digits from 0 to 9 have been collected for the recognition test. Two widely used classification methods, HMM and RDA, have been adopted for comparison and the results demonstrate that the proposed algorithm deliver the best performance among these methods.</abstract><cop>New York, NY</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2008.2004924</doi><tpages>6</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1051-8215
ispartof IEEE transactions on circuits and systems for video technology, 2008-12, Vol.18 (12), p.1760-1765
issn 1051-8215
1558-2205
language eng
recordid cdi_proquest_journals_912282601
source IEEE Xplore
subjects Algorithms
Applied sciences
Discrete transforms
Exact sciences and technology
Hidden Markov models
Image recognition
Image segmentation
Image sequences
Information, signal and communications theory
Lipreading
Miscellaneous
Mouth
Pattern recognition
Signal processing
Speech processing
Speech recognition
Spline
Studies
Telecommunications and information theory
Training data
visual feature extraction
visual speech recognition
Vocabulary
title An Automatic Lipreading System for Spoken Digits With Limited Training Data
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-29T06%3A50%3A44IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20Automatic%20Lipreading%20System%20for%20Spoken%20Digits%20With%20Limited%20Training%20Data&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Wang,%20S.L.&rft.date=2008-12-01&rft.volume=18&rft.issue=12&rft.spage=1760&rft.epage=1765&rft.pages=1760-1765&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2008.2004924&rft_dat=%3Cproquest_RIE%3E34506335%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=912282601&rft_id=info:pmid/&rft_ieee_id=4625976&rfr_iscdi=true