Evaluation of text-to-gesture generation model using convolutional neural network

Conversational gestures have a crucial role in realizing natural interactions with virtual agents and robots. Data-driven approaches, such as deep learning and machine learning, are promising in constructing the gesture generation model, which automatically provides the gesture motion for speech or...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Neural networks 2022-07, Vol.151, p.365-375
Hauptverfasser:	Asakawa, Eiichi, Kaneko, Naoshi, Hasegawa, Dai, Shirakawa, Shinichi
Format:	Artikel
Sprache:	eng
Schlagworte:	Convolutional neural network Deep learning Gesture generation Spoken text Transformer architecture
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	375
container_issue
container_start_page	365
container_title	Neural networks
container_volume	151
creator	Asakawa, Eiichi Kaneko, Naoshi Hasegawa, Dai Shirakawa, Shinichi
description	Conversational gestures have a crucial role in realizing natural interactions with virtual agents and robots. Data-driven approaches, such as deep learning and machine learning, are promising in constructing the gesture generation model, which automatically provides the gesture motion for speech or spoken texts. This study experimentally analyzes a deep learning-based gesture generation model from spoken text using a convolutional neural network. The proposed model takes a sequence of spoken words as the input and outputs a sequence of 2D joint coordinates representing the conversational gesture motion. We prepare a dataset consisting of gesture motions and spoken texts by adding text information to an existing dataset and train the models using specific speaker’s data. The quality of the generated gestures is compared with those from an existing speech-to-gesture generation model through a user perceptual study. The subjective evaluation shows that the model performance is comparable or superior to those by the existing speech-to-gesture generation model. In addition, we investigate the importance of data cleansing and loss function selection in the text-to-gesture generation model. We further examine the model transferability between speakers. The experimental results demonstrate successful model transferability of the proposed model. Finally, we show that the text-to-gesture generation model can produce good quality gestures even when using a transformer architecture. •The quality of text-to-gesture generation models is evaluated by human perceptual studies.•The quality of text-to-gesture generation models is comparable to speech-to-gesture models.•Data cleansing and loss function selection are important in text-to-gesture generation models.•The possibility of model transfer between speakers is demonstrated.
doi_str_mv	10.1016/j.neunet.2022.03.041
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2656200768</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0893608022001198</els_id><sourcerecordid>2656200768</sourcerecordid><originalsourceid>FETCH-LOGICAL-c474t-38d44ccb169ce5f940a5d9eb52979bee51aa92e86a3fc875f199e5ec5849dae33</originalsourceid><addsrcrecordid>eNp9kFtLxDAQhYMoul7-gUgffWmd3NrkRRDxBoII-hyy6XTp2m00SVf993at-ujTgZkzc2Y-Qo4pFBRoebYsehx6TAUDxgrgBQi6RWZUVTpnlWLbZAZK87wEBXtkP8YlAJRK8F2yx6WoWMVhRh6v1rYbbGp9n_kmS_iR8uTzBcY0BMwW2GOYuitfY5cNse0XmfP92nfDpm67bLwjfEt69-HlkOw0tot49KMH5Pn66unyNr9_uLm7vLjPnahEyrmqhXBuTkvtUDZagJW1xrlkutJzREmt1QxVaXnjVCUbqjVKdFIJXVvk_ICcTntfg38bxnvNqo0Ou8726IdoWClLBlCVarSKyeqCjzFgY15Du7Lh01AwG5hmaSaYZgPTADcjzHHs5CdhmK-w_hv6pTcazicDjn-uWwwmuhZ7h3Ub0CVT-_b_hC8TpomP</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2656200768</pqid></control><display><type>article</type><title>Evaluation of text-to-gesture generation model using convolutional neural network</title><source>Elsevier ScienceDirect Journals Complete</source><creator>Asakawa, Eiichi ; Kaneko, Naoshi ; Hasegawa, Dai ; Shirakawa, Shinichi</creator><creatorcontrib>Asakawa, Eiichi ; Kaneko, Naoshi ; Hasegawa, Dai ; Shirakawa, Shinichi</creatorcontrib><description>Conversational gestures have a crucial role in realizing natural interactions with virtual agents and robots. Data-driven approaches, such as deep learning and machine learning, are promising in constructing the gesture generation model, which automatically provides the gesture motion for speech or spoken texts. This study experimentally analyzes a deep learning-based gesture generation model from spoken text using a convolutional neural network. The proposed model takes a sequence of spoken words as the input and outputs a sequence of 2D joint coordinates representing the conversational gesture motion. We prepare a dataset consisting of gesture motions and spoken texts by adding text information to an existing dataset and train the models using specific speaker’s data. The quality of the generated gestures is compared with those from an existing speech-to-gesture generation model through a user perceptual study. The subjective evaluation shows that the model performance is comparable or superior to those by the existing speech-to-gesture generation model. In addition, we investigate the importance of data cleansing and loss function selection in the text-to-gesture generation model. We further examine the model transferability between speakers. The experimental results demonstrate successful model transferability of the proposed model. Finally, we show that the text-to-gesture generation model can produce good quality gestures even when using a transformer architecture. •The quality of text-to-gesture generation models is evaluated by human perceptual studies.•The quality of text-to-gesture generation models is comparable to speech-to-gesture models.•Data cleansing and loss function selection are important in text-to-gesture generation models.•The possibility of model transfer between speakers is demonstrated.</description><identifier>ISSN: 0893-6080</identifier><identifier>EISSN: 1879-2782</identifier><identifier>DOI: 10.1016/j.neunet.2022.03.041</identifier><identifier>PMID: 35472730</identifier><language>eng</language><publisher>United States: Elsevier Ltd</publisher><subject>Convolutional neural network ; Deep learning ; Gesture generation ; Spoken text ; Transformer architecture</subject><ispartof>Neural networks, 2022-07, Vol.151, p.365-375</ispartof><rights>2022 The Author(s)</rights><rights>Copyright © 2022 The Author(s). Published by Elsevier Ltd.. All rights reserved.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c474t-38d44ccb169ce5f940a5d9eb52979bee51aa92e86a3fc875f199e5ec5849dae33</citedby><cites>FETCH-LOGICAL-c474t-38d44ccb169ce5f940a5d9eb52979bee51aa92e86a3fc875f199e5ec5849dae33</cites><orcidid>0000-0002-5638-2509 ; 0000-0002-4659-6108</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.sciencedirect.com/science/article/pii/S0893608022001198$$EHTML$$P50$$Gelsevier$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,3537,27901,27902,65534</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/35472730$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Asakawa, Eiichi</creatorcontrib><creatorcontrib>Kaneko, Naoshi</creatorcontrib><creatorcontrib>Hasegawa, Dai</creatorcontrib><creatorcontrib>Shirakawa, Shinichi</creatorcontrib><title>Evaluation of text-to-gesture generation model using convolutional neural network</title><title>Neural networks</title><addtitle>Neural Netw</addtitle><description>Conversational gestures have a crucial role in realizing natural interactions with virtual agents and robots. Data-driven approaches, such as deep learning and machine learning, are promising in constructing the gesture generation model, which automatically provides the gesture motion for speech or spoken texts. This study experimentally analyzes a deep learning-based gesture generation model from spoken text using a convolutional neural network. The proposed model takes a sequence of spoken words as the input and outputs a sequence of 2D joint coordinates representing the conversational gesture motion. We prepare a dataset consisting of gesture motions and spoken texts by adding text information to an existing dataset and train the models using specific speaker’s data. The quality of the generated gestures is compared with those from an existing speech-to-gesture generation model through a user perceptual study. The subjective evaluation shows that the model performance is comparable or superior to those by the existing speech-to-gesture generation model. In addition, we investigate the importance of data cleansing and loss function selection in the text-to-gesture generation model. We further examine the model transferability between speakers. The experimental results demonstrate successful model transferability of the proposed model. Finally, we show that the text-to-gesture generation model can produce good quality gestures even when using a transformer architecture. •The quality of text-to-gesture generation models is evaluated by human perceptual studies.•The quality of text-to-gesture generation models is comparable to speech-to-gesture models.•Data cleansing and loss function selection are important in text-to-gesture generation models.•The possibility of model transfer between speakers is demonstrated.</description><subject>Convolutional neural network</subject><subject>Deep learning</subject><subject>Gesture generation</subject><subject>Spoken text</subject><subject>Transformer architecture</subject><issn>0893-6080</issn><issn>1879-2782</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNp9kFtLxDAQhYMoul7-gUgffWmd3NrkRRDxBoII-hyy6XTp2m00SVf993at-ujTgZkzc2Y-Qo4pFBRoebYsehx6TAUDxgrgBQi6RWZUVTpnlWLbZAZK87wEBXtkP8YlAJRK8F2yx6WoWMVhRh6v1rYbbGp9n_kmS_iR8uTzBcY0BMwW2GOYuitfY5cNse0XmfP92nfDpm67bLwjfEt69-HlkOw0tot49KMH5Pn66unyNr9_uLm7vLjPnahEyrmqhXBuTkvtUDZagJW1xrlkutJzREmt1QxVaXnjVCUbqjVKdFIJXVvk_ICcTntfg38bxnvNqo0Ou8726IdoWClLBlCVarSKyeqCjzFgY15Du7Lh01AwG5hmaSaYZgPTADcjzHHs5CdhmK-w_hv6pTcazicDjn-uWwwmuhZ7h3Ub0CVT-_b_hC8TpomP</recordid><startdate>202207</startdate><enddate>202207</enddate><creator>Asakawa, Eiichi</creator><creator>Kaneko, Naoshi</creator><creator>Hasegawa, Dai</creator><creator>Shirakawa, Shinichi</creator><general>Elsevier Ltd</general><scope>6I.</scope><scope>AAFTH</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-5638-2509</orcidid><orcidid>https://orcid.org/0000-0002-4659-6108</orcidid></search><sort><creationdate>202207</creationdate><title>Evaluation of text-to-gesture generation model using convolutional neural network</title><author>Asakawa, Eiichi ; Kaneko, Naoshi ; Hasegawa, Dai ; Shirakawa, Shinichi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c474t-38d44ccb169ce5f940a5d9eb52979bee51aa92e86a3fc875f199e5ec5849dae33</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Convolutional neural network</topic><topic>Deep learning</topic><topic>Gesture generation</topic><topic>Spoken text</topic><topic>Transformer architecture</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Asakawa, Eiichi</creatorcontrib><creatorcontrib>Kaneko, Naoshi</creatorcontrib><creatorcontrib>Hasegawa, Dai</creatorcontrib><creatorcontrib>Shirakawa, Shinichi</creatorcontrib><collection>ScienceDirect Open Access Titles</collection><collection>Elsevier:ScienceDirect:Open Access</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><jtitle>Neural networks</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Asakawa, Eiichi</au><au>Kaneko, Naoshi</au><au>Hasegawa, Dai</au><au>Shirakawa, Shinichi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Evaluation of text-to-gesture generation model using convolutional neural network</atitle><jtitle>Neural networks</jtitle><addtitle>Neural Netw</addtitle><date>2022-07</date><risdate>2022</risdate><volume>151</volume><spage>365</spage><epage>375</epage><pages>365-375</pages><issn>0893-6080</issn><eissn>1879-2782</eissn><abstract>Conversational gestures have a crucial role in realizing natural interactions with virtual agents and robots. Data-driven approaches, such as deep learning and machine learning, are promising in constructing the gesture generation model, which automatically provides the gesture motion for speech or spoken texts. This study experimentally analyzes a deep learning-based gesture generation model from spoken text using a convolutional neural network. The proposed model takes a sequence of spoken words as the input and outputs a sequence of 2D joint coordinates representing the conversational gesture motion. We prepare a dataset consisting of gesture motions and spoken texts by adding text information to an existing dataset and train the models using specific speaker’s data. The quality of the generated gestures is compared with those from an existing speech-to-gesture generation model through a user perceptual study. The subjective evaluation shows that the model performance is comparable or superior to those by the existing speech-to-gesture generation model. In addition, we investigate the importance of data cleansing and loss function selection in the text-to-gesture generation model. We further examine the model transferability between speakers. The experimental results demonstrate successful model transferability of the proposed model. Finally, we show that the text-to-gesture generation model can produce good quality gestures even when using a transformer architecture. •The quality of text-to-gesture generation models is evaluated by human perceptual studies.•The quality of text-to-gesture generation models is comparable to speech-to-gesture models.•Data cleansing and loss function selection are important in text-to-gesture generation models.•The possibility of model transfer between speakers is demonstrated.</abstract><cop>United States</cop><pub>Elsevier Ltd</pub><pmid>35472730</pmid><doi>10.1016/j.neunet.2022.03.041</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0002-5638-2509</orcidid><orcidid>https://orcid.org/0000-0002-4659-6108</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0893-6080
ispartof	Neural networks, 2022-07, Vol.151, p.365-375
issn	0893-6080 1879-2782
language	eng
recordid	cdi_proquest_miscellaneous_2656200768
source	Elsevier ScienceDirect Journals Complete
subjects	Convolutional neural network Deep learning Gesture generation Spoken text Transformer architecture
title	Evaluation of text-to-gesture generation model using convolutional neural network
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-21T15%3A01%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Evaluation%20of%20text-to-gesture%20generation%20model%20using%20convolutional%20neural%20network&rft.jtitle=Neural%20networks&rft.au=Asakawa,%20Eiichi&rft.date=2022-07&rft.volume=151&rft.spage=365&rft.epage=375&rft.pages=365-375&rft.issn=0893-6080&rft.eissn=1879-2782&rft_id=info:doi/10.1016/j.neunet.2022.03.041&rft_dat=%3Cproquest_cross%3E2656200768%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2656200768&rft_id=info:pmid/35472730&rft_els_id=S0893608022001198&rfr_iscdi=true