A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training

This work develops a continuous sign language (SL) recognition framework with deep neural networks, which directly transcribes videos of SL sentences to sequences of ordered gloss labels. Previous methods dealing with continuous SL recognition usually employ hidden Markov models with limited capacit...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on multimedia 2019-07, Vol.21 (7), p.1880-1891
Hauptverfasser:	Cui, Runpeng, Liu, Hu, Zhang, Changshui
Format:	Artikel
Sprache:	eng
Schlagworte:	Alignment Architecture Artificial neural networks Color imagery Continuous sign language recognition Convolutional neural networks Feature extraction Gesture recognition Gloss Hidden Markov models Iterative methods iterative training Markov chains Modules multimodal fusion Neural networks Optical flow (image analysis) Optimization Recognition Recurrent neural networks Sentences sequence learning Sign language Training Videos
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1891
container_issue	7
container_start_page	1880
container_title	IEEE transactions on multimedia
container_volume	21
creator	Cui, Runpeng Liu, Hu Zhang, Changshui
description	This work develops a continuous sign language (SL) recognition framework with deep neural networks, which directly transcribes videos of SL sentences to sequences of ordered gloss labels. Previous methods dealing with continuous SL recognition usually employ hidden Markov models with limited capacity to capture the temporal information. In contrast, our proposed architecture adopts deep convolutional neural networks with stacked temporal fusion layers as the feature extraction module, and bidirectional recurrent neural networks as the sequence learning module. We propose an iterative optimization process for our architecture to fully exploit the representation capability of deep neural networks with limited data. We first train the end-to-end recognition model for alignment proposal, and then use the alignment proposal as strong supervisory information to directly tune the feature extraction module. This training process can run iteratively to achieve improvements on the recognition performance. We further contribute by exploring the multimodal fusion of RGB images and optical flow in sign language. Our method is evaluated on two challenging SL recognition benchmarks, and outperforms the state of the art by a relative improvement of more than 15% on both databases.
doi_str_mv	10.1109/TMM.2018.2889563
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2247924953</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8598757</ieee_id><sourcerecordid>2247924953</sourcerecordid><originalsourceid>FETCH-LOGICAL-c357t-5afcba8c120f4c9b151d2fa26496e98224dffa516102a6650f856488485a19743</originalsourceid><addsrcrecordid>eNo9kMFOAjEQQBujiYjeTbw08bw47bbb9khQlAQ0UfTalKXdFKHF7q6Ev3cJxNPM4b2Z5CF0S2BACKiH-Ww2oEDkgEqpeJGfoR5RjGQAQpx3O6eQKUrgEl3V9QqAMA6ih76G-NHaLX61bTJrPE5mY3cxfWMXEx7F0PjQxrbGH74KeGpC1ZrK4ndbxir4xseAF3s8aWwyjf-1eJ6MDz5U1-jCmXVtb06zjz7HT_PRSzZ9e56MhtOszLloMm5cuTCyJBQcK9WCcLKkztCCqcIqSSlbOmc4KQhQUxQcnOQFk5JJbogSLO-j--PdbYo_ra0bvYptCt1L3clCUaZ43lFwpMoU6zpZp7fJb0zaawL6UE939fShnj7V65S7o-Kttf-45EoKLvI_eotqIw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2247924953</pqid></control><display><type>article</type><title>A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training</title><source>IEEE Electronic Library (IEL)</source><creator>Cui, Runpeng ; Liu, Hu ; Zhang, Changshui</creator><creatorcontrib>Cui, Runpeng ; Liu, Hu ; Zhang, Changshui</creatorcontrib><description>This work develops a continuous sign language (SL) recognition framework with deep neural networks, which directly transcribes videos of SL sentences to sequences of ordered gloss labels. Previous methods dealing with continuous SL recognition usually employ hidden Markov models with limited capacity to capture the temporal information. In contrast, our proposed architecture adopts deep convolutional neural networks with stacked temporal fusion layers as the feature extraction module, and bidirectional recurrent neural networks as the sequence learning module. We propose an iterative optimization process for our architecture to fully exploit the representation capability of deep neural networks with limited data. We first train the end-to-end recognition model for alignment proposal, and then use the alignment proposal as strong supervisory information to directly tune the feature extraction module. This training process can run iteratively to achieve improvements on the recognition performance. We further contribute by exploring the multimodal fusion of RGB images and optical flow in sign language. Our method is evaluated on two challenging SL recognition benchmarks, and outperforms the state of the art by a relative improvement of more than 15% on both databases.</description><identifier>ISSN: 1520-9210</identifier><identifier>EISSN: 1941-0077</identifier><identifier>DOI: 10.1109/TMM.2018.2889563</identifier><identifier>CODEN: ITMUF8</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Alignment ; Architecture ; Artificial neural networks ; Color imagery ; Continuous sign language recognition ; Convolutional neural networks ; Feature extraction ; Gesture recognition ; Gloss ; Hidden Markov models ; Iterative methods ; iterative training ; Markov chains ; Modules ; multimodal fusion ; Neural networks ; Optical flow (image analysis) ; Optimization ; Recognition ; Recurrent neural networks ; Sentences ; sequence learning ; Sign language ; Training ; Videos</subject><ispartof>IEEE transactions on multimedia, 2019-07, Vol.21 (7), p.1880-1891</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2019</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c357t-5afcba8c120f4c9b151d2fa26496e98224dffa516102a6650f856488485a19743</citedby><cites>FETCH-LOGICAL-c357t-5afcba8c120f4c9b151d2fa26496e98224dffa516102a6650f856488485a19743</cites><orcidid>0000-0003-2225-7387 ; 0000-0002-8088-367X ; 0000-0002-4737-788X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8598757$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8598757$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Cui, Runpeng</creatorcontrib><creatorcontrib>Liu, Hu</creatorcontrib><creatorcontrib>Zhang, Changshui</creatorcontrib><title>A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training</title><title>IEEE transactions on multimedia</title><addtitle>TMM</addtitle><description>This work develops a continuous sign language (SL) recognition framework with deep neural networks, which directly transcribes videos of SL sentences to sequences of ordered gloss labels. Previous methods dealing with continuous SL recognition usually employ hidden Markov models with limited capacity to capture the temporal information. In contrast, our proposed architecture adopts deep convolutional neural networks with stacked temporal fusion layers as the feature extraction module, and bidirectional recurrent neural networks as the sequence learning module. We propose an iterative optimization process for our architecture to fully exploit the representation capability of deep neural networks with limited data. We first train the end-to-end recognition model for alignment proposal, and then use the alignment proposal as strong supervisory information to directly tune the feature extraction module. This training process can run iteratively to achieve improvements on the recognition performance. We further contribute by exploring the multimodal fusion of RGB images and optical flow in sign language. Our method is evaluated on two challenging SL recognition benchmarks, and outperforms the state of the art by a relative improvement of more than 15% on both databases.</description><subject>Alignment</subject><subject>Architecture</subject><subject>Artificial neural networks</subject><subject>Color imagery</subject><subject>Continuous sign language recognition</subject><subject>Convolutional neural networks</subject><subject>Feature extraction</subject><subject>Gesture recognition</subject><subject>Gloss</subject><subject>Hidden Markov models</subject><subject>Iterative methods</subject><subject>iterative training</subject><subject>Markov chains</subject><subject>Modules</subject><subject>multimodal fusion</subject><subject>Neural networks</subject><subject>Optical flow (image analysis)</subject><subject>Optimization</subject><subject>Recognition</subject><subject>Recurrent neural networks</subject><subject>Sentences</subject><subject>sequence learning</subject><subject>Sign language</subject><subject>Training</subject><subject>Videos</subject><issn>1520-9210</issn><issn>1941-0077</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kMFOAjEQQBujiYjeTbw08bw47bbb9khQlAQ0UfTalKXdFKHF7q6Ev3cJxNPM4b2Z5CF0S2BACKiH-Ww2oEDkgEqpeJGfoR5RjGQAQpx3O6eQKUrgEl3V9QqAMA6ih76G-NHaLX61bTJrPE5mY3cxfWMXEx7F0PjQxrbGH74KeGpC1ZrK4ndbxir4xseAF3s8aWwyjf-1eJ6MDz5U1-jCmXVtb06zjz7HT_PRSzZ9e56MhtOszLloMm5cuTCyJBQcK9WCcLKkztCCqcIqSSlbOmc4KQhQUxQcnOQFk5JJbogSLO-j--PdbYo_ra0bvYptCt1L3clCUaZ43lFwpMoU6zpZp7fJb0zaawL6UE939fShnj7V65S7o-Kttf-45EoKLvI_eotqIw</recordid><startdate>20190701</startdate><enddate>20190701</enddate><creator>Cui, Runpeng</creator><creator>Liu, Hu</creator><creator>Zhang, Changshui</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-2225-7387</orcidid><orcidid>https://orcid.org/0000-0002-8088-367X</orcidid><orcidid>https://orcid.org/0000-0002-4737-788X</orcidid></search><sort><creationdate>20190701</creationdate><title>A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training</title><author>Cui, Runpeng ; Liu, Hu ; Zhang, Changshui</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c357t-5afcba8c120f4c9b151d2fa26496e98224dffa516102a6650f856488485a19743</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Alignment</topic><topic>Architecture</topic><topic>Artificial neural networks</topic><topic>Color imagery</topic><topic>Continuous sign language recognition</topic><topic>Convolutional neural networks</topic><topic>Feature extraction</topic><topic>Gesture recognition</topic><topic>Gloss</topic><topic>Hidden Markov models</topic><topic>Iterative methods</topic><topic>iterative training</topic><topic>Markov chains</topic><topic>Modules</topic><topic>multimodal fusion</topic><topic>Neural networks</topic><topic>Optical flow (image analysis)</topic><topic>Optimization</topic><topic>Recognition</topic><topic>Recurrent neural networks</topic><topic>Sentences</topic><topic>sequence learning</topic><topic>Sign language</topic><topic>Training</topic><topic>Videos</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Cui, Runpeng</creatorcontrib><creatorcontrib>Liu, Hu</creatorcontrib><creatorcontrib>Zhang, Changshui</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on multimedia</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Cui, Runpeng</au><au>Liu, Hu</au><au>Zhang, Changshui</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training</atitle><jtitle>IEEE transactions on multimedia</jtitle><stitle>TMM</stitle><date>2019-07-01</date><risdate>2019</risdate><volume>21</volume><issue>7</issue><spage>1880</spage><epage>1891</epage><pages>1880-1891</pages><issn>1520-9210</issn><eissn>1941-0077</eissn><coden>ITMUF8</coden><abstract>This work develops a continuous sign language (SL) recognition framework with deep neural networks, which directly transcribes videos of SL sentences to sequences of ordered gloss labels. Previous methods dealing with continuous SL recognition usually employ hidden Markov models with limited capacity to capture the temporal information. In contrast, our proposed architecture adopts deep convolutional neural networks with stacked temporal fusion layers as the feature extraction module, and bidirectional recurrent neural networks as the sequence learning module. We propose an iterative optimization process for our architecture to fully exploit the representation capability of deep neural networks with limited data. We first train the end-to-end recognition model for alignment proposal, and then use the alignment proposal as strong supervisory information to directly tune the feature extraction module. This training process can run iteratively to achieve improvements on the recognition performance. We further contribute by exploring the multimodal fusion of RGB images and optical flow in sign language. Our method is evaluated on two challenging SL recognition benchmarks, and outperforms the state of the art by a relative improvement of more than 15% on both databases.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TMM.2018.2889563</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0003-2225-7387</orcidid><orcidid>https://orcid.org/0000-0002-8088-367X</orcidid><orcidid>https://orcid.org/0000-0002-4737-788X</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1520-9210
ispartof	IEEE transactions on multimedia, 2019-07, Vol.21 (7), p.1880-1891
issn	1520-9210 1941-0077
language	eng
recordid	cdi_proquest_journals_2247924953
source	IEEE Electronic Library (IEL)
subjects	Alignment Architecture Artificial neural networks Color imagery Continuous sign language recognition Convolutional neural networks Feature extraction Gesture recognition Gloss Hidden Markov models Iterative methods iterative training Markov chains Modules multimodal fusion Neural networks Optical flow (image analysis) Optimization Recognition Recurrent neural networks Sentences sequence learning Sign language Training Videos
title	A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-21T20%3A28%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Deep%20Neural%20Framework%20for%20Continuous%20Sign%20Language%20Recognition%20by%20Iterative%20Training&rft.jtitle=IEEE%20transactions%20on%20multimedia&rft.au=Cui,%20Runpeng&rft.date=2019-07-01&rft.volume=21&rft.issue=7&rft.spage=1880&rft.epage=1891&rft.pages=1880-1891&rft.issn=1520-9210&rft.eissn=1941-0077&rft.coden=ITMUF8&rft_id=info:doi/10.1109/TMM.2018.2889563&rft_dat=%3Cproquest_RIE%3E2247924953%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2247924953&rft_id=info:pmid/&rft_ieee_id=8598757&rfr_iscdi=true