Speech Discrimination in Real-World Group Communication Using Audio-Motion Multimodal Sensing

Speech discrimination that determines whether a participant is speaking at a given moment is essential in investigating human verbal communication. Specifically, in dynamic real-world situations where multiple people participate in, and form, groups in the same space, simultaneous speakers render sp...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Sensors (Basel, Switzerland) Switzerland), 2020-05, Vol.20 (10), p.2948, Article 2948
Hauptverfasser:	Nozawa, Takayuki, Uchiyama, Mizuki, Honda, Keigo, Nakano, Tamio, Miyake, Yoshihiro
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Audio data Chemistry Chemistry, Analytical Conferences Data acquisition Discrimination Engineering Engineering, Electrical & Electronic Frequency ranges Group communication Hypotheses Instruments & Instrumentation multimodal sensing physical motion Physical Sciences Science & Technology sensor fusion Sensors smartphone Smartphones Speech speech discrimination Students Technology Verbal communication Voice recognition
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue	10
container_start_page	2948
container_title	Sensors (Basel, Switzerland)
container_volume	20
creator	Nozawa, Takayuki Uchiyama, Mizuki Honda, Keigo Nakano, Tamio Miyake, Yoshihiro
description	Speech discrimination that determines whether a participant is speaking at a given moment is essential in investigating human verbal communication. Specifically, in dynamic real-world situations where multiple people participate in, and form, groups in the same space, simultaneous speakers render speech discrimination that is solely based on audio sensing difficult. In this study, we focused on physical activity during speech, and hypothesized that combining audio and physical motion data acquired by wearable sensors can improve speech discrimination. Thus, utterance and physical activity data of students in a university participatory class were recorded, using smartphones worn around their neck. First, we tested the temporal relationship between manually identified utterances and physical motions and confirmed that physical activities in wide-frequency ranges co-occurred with utterances. Second, we trained and tested classifiers for each participant and found a higher performance with the audio-motion classifier (average accuracy 92.2%) than both the audio-only (80.4%) and motion-only (87.8%) classifiers. Finally, we tested inter-individual classification and obtained a higher performance with the audio-motion combined classifier (83.2%) than the audio-only (67.7%) and motion-only (71.9%) classifiers. These results show that audio-motion multimodal sensing using widely available smartphones can provide effective utterance discrimination in dynamic group communications.
doi_str_mv	10.3390/s20102948
format	Article
fullrecord	<record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmed_primary_32456031</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><doaj_id>oai_doaj_org_article_c1e76932cc854397957b4b5a36a9dd88</doaj_id><sourcerecordid>2407313608</sourcerecordid><originalsourceid>FETCH-LOGICAL-c579t-5d2c84697701563cb75231f0e76e92eb97a829fecbe970e626e7422e01a43ab33</originalsourceid><addsrcrecordid>eNqNkl1rFDEUhoMotq5e-AdkwBtFRvMxmSQ3hTJqLbQI1uKVhEzmzDbLTLImM4r_3uxOXVqvvErIeXhycvIi9Jzgt4wp_C5RTDBVlXyAjklFq1JSih_e2R-hJyltMKaMMfkYHTFa8Rozcoy-X20B7E3x3iUb3ei8mVzwhfPFFzBD-S3EoSvOYpi3RRPGcfbOLsR1cn5dnM6dC-Vl2B9dzsPkxtCZobgCv6s_RY96MyR4druu0PXHD1-bT-XF57Pz5vSitFyoqeQdtbKqlRCY8JrZVnDKSI9B1KAotEoYSVUPtgUlMNS0BlFRCpiYipmWsRU6X7xdMBu9zQ8x8bcOxun9QYhrbeLk7ADakmxVjForecWUUFy0VcsNq43qOimz62Rxbed2hM6Cn6IZ7knvV7y70evwUwsqheA8C17dCmL4MUOa9JiHC8NgPIQ5aVphwQir8e6ul_-gmzBHn0e1o2oheXZm6vVC2RhSitAfmiFY7wKgDwHI7Iu73R_Ivz-egTcL8Ava0CfrwFs4YBhjzvJ0mMhhyXFZIfn_dOOmfTaaMPuJ_QGwLMrG</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2406785287</pqid></control><display><type>article</type><title>Speech Discrimination in Real-World Group Communication Using Audio-Motion Multimodal Sensing</title><source>DOAJ Directory of Open Access Journals</source><source>MDPI - Multidisciplinary Digital Publishing Institute</source><source>Web of Science - Science Citation Index Expanded - 2020<img src="https://exlibris-pub.s3.amazonaws.com/fromwos-v2.jpg" /></source><source>EZB-FREE-00999 freely available EZB journals</source><source>PubMed Central</source><source>Web of Science - Social Sciences Citation Index – 2020<img src="https://exlibris-pub.s3.amazonaws.com/fromwos-v2.jpg" /></source><source>Free Full-Text Journals in Chemistry</source><creator>Nozawa, Takayuki ; Uchiyama, Mizuki ; Honda, Keigo ; Nakano, Tamio ; Miyake, Yoshihiro</creator><creatorcontrib>Nozawa, Takayuki ; Uchiyama, Mizuki ; Honda, Keigo ; Nakano, Tamio ; Miyake, Yoshihiro</creatorcontrib><description>Speech discrimination that determines whether a participant is speaking at a given moment is essential in investigating human verbal communication. Specifically, in dynamic real-world situations where multiple people participate in, and form, groups in the same space, simultaneous speakers render speech discrimination that is solely based on audio sensing difficult. In this study, we focused on physical activity during speech, and hypothesized that combining audio and physical motion data acquired by wearable sensors can improve speech discrimination. Thus, utterance and physical activity data of students in a university participatory class were recorded, using smartphones worn around their neck. First, we tested the temporal relationship between manually identified utterances and physical motions and confirmed that physical activities in wide-frequency ranges co-occurred with utterances. Second, we trained and tested classifiers for each participant and found a higher performance with the audio-motion classifier (average accuracy 92.2%) than both the audio-only (80.4%) and motion-only (87.8%) classifiers. Finally, we tested inter-individual classification and obtained a higher performance with the audio-motion combined classifier (83.2%) than the audio-only (67.7%) and motion-only (71.9%) classifiers. These results show that audio-motion multimodal sensing using widely available smartphones can provide effective utterance discrimination in dynamic group communications.</description><identifier>ISSN: 1424-8220</identifier><identifier>EISSN: 1424-8220</identifier><identifier>DOI: 10.3390/s20102948</identifier><identifier>PMID: 32456031</identifier><language>eng</language><publisher>BASEL: Mdpi</publisher><subject>Accuracy ; Audio data ; Chemistry ; Chemistry, Analytical ; Conferences ; Data acquisition ; Discrimination ; Engineering ; Engineering, Electrical & Electronic ; Frequency ranges ; Group communication ; Hypotheses ; Instruments & Instrumentation ; multimodal sensing ; physical motion ; Physical Sciences ; Science & Technology ; sensor fusion ; Sensors ; smartphone ; Smartphones ; Speech ; speech discrimination ; Students ; Technology ; Verbal communication ; Voice recognition</subject><ispartof>Sensors (Basel, Switzerland), 2020-05, Vol.20 (10), p.2948, Article 2948</ispartof><rights>2020. This work is licensed under http://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>2020 by the authors. 2020</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>true</woscitedreferencessubscribed><woscitedreferencescount>1</woscitedreferencescount><woscitedreferencesoriginalsourcerecordid>wos000539323700202</woscitedreferencesoriginalsourcerecordid><citedby>FETCH-LOGICAL-c579t-5d2c84697701563cb75231f0e76e92eb97a829fecbe970e626e7422e01a43ab33</citedby><cites>FETCH-LOGICAL-c579t-5d2c84697701563cb75231f0e76e92eb97a829fecbe970e626e7422e01a43ab33</cites><orcidid>0000-0001-6300-4373</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC7287755/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC7287755/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,315,728,781,785,865,886,2103,2115,27929,27930,28253,28254,53796,53798</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/32456031$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Nozawa, Takayuki</creatorcontrib><creatorcontrib>Uchiyama, Mizuki</creatorcontrib><creatorcontrib>Honda, Keigo</creatorcontrib><creatorcontrib>Nakano, Tamio</creatorcontrib><creatorcontrib>Miyake, Yoshihiro</creatorcontrib><title>Speech Discrimination in Real-World Group Communication Using Audio-Motion Multimodal Sensing</title><title>Sensors (Basel, Switzerland)</title><addtitle>SENSORS-BASEL</addtitle><addtitle>Sensors (Basel)</addtitle><description>Speech discrimination that determines whether a participant is speaking at a given moment is essential in investigating human verbal communication. Specifically, in dynamic real-world situations where multiple people participate in, and form, groups in the same space, simultaneous speakers render speech discrimination that is solely based on audio sensing difficult. In this study, we focused on physical activity during speech, and hypothesized that combining audio and physical motion data acquired by wearable sensors can improve speech discrimination. Thus, utterance and physical activity data of students in a university participatory class were recorded, using smartphones worn around their neck. First, we tested the temporal relationship between manually identified utterances and physical motions and confirmed that physical activities in wide-frequency ranges co-occurred with utterances. Second, we trained and tested classifiers for each participant and found a higher performance with the audio-motion classifier (average accuracy 92.2%) than both the audio-only (80.4%) and motion-only (87.8%) classifiers. Finally, we tested inter-individual classification and obtained a higher performance with the audio-motion combined classifier (83.2%) than the audio-only (67.7%) and motion-only (71.9%) classifiers. These results show that audio-motion multimodal sensing using widely available smartphones can provide effective utterance discrimination in dynamic group communications.</description><subject>Accuracy</subject><subject>Audio data</subject><subject>Chemistry</subject><subject>Chemistry, Analytical</subject><subject>Conferences</subject><subject>Data acquisition</subject><subject>Discrimination</subject><subject>Engineering</subject><subject>Engineering, Electrical & Electronic</subject><subject>Frequency ranges</subject><subject>Group communication</subject><subject>Hypotheses</subject><subject>Instruments & Instrumentation</subject><subject>multimodal sensing</subject><subject>physical motion</subject><subject>Physical Sciences</subject><subject>Science & Technology</subject><subject>sensor fusion</subject><subject>Sensors</subject><subject>smartphone</subject><subject>Smartphones</subject><subject>Speech</subject><subject>speech discrimination</subject><subject>Students</subject><subject>Technology</subject><subject>Verbal communication</subject><subject>Voice recognition</subject><issn>1424-8220</issn><issn>1424-8220</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>AOWDO</sourceid><sourceid>ARHDP</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>DOA</sourceid><recordid>eNqNkl1rFDEUhoMotq5e-AdkwBtFRvMxmSQ3hTJqLbQI1uKVhEzmzDbLTLImM4r_3uxOXVqvvErIeXhycvIi9Jzgt4wp_C5RTDBVlXyAjklFq1JSih_e2R-hJyltMKaMMfkYHTFa8Rozcoy-X20B7E3x3iUb3ei8mVzwhfPFFzBD-S3EoSvOYpi3RRPGcfbOLsR1cn5dnM6dC-Vl2B9dzsPkxtCZobgCv6s_RY96MyR4druu0PXHD1-bT-XF57Pz5vSitFyoqeQdtbKqlRCY8JrZVnDKSI9B1KAotEoYSVUPtgUlMNS0BlFRCpiYipmWsRU6X7xdMBu9zQ8x8bcOxun9QYhrbeLk7ADakmxVjForecWUUFy0VcsNq43qOimz62Rxbed2hM6Cn6IZ7knvV7y70evwUwsqheA8C17dCmL4MUOa9JiHC8NgPIQ5aVphwQir8e6ul_-gmzBHn0e1o2oheXZm6vVC2RhSitAfmiFY7wKgDwHI7Iu73R_Ivz-egTcL8Ava0CfrwFs4YBhjzvJ0mMhhyXFZIfn_dOOmfTaaMPuJ_QGwLMrG</recordid><startdate>20200522</startdate><enddate>20200522</enddate><creator>Nozawa, Takayuki</creator><creator>Uchiyama, Mizuki</creator><creator>Honda, Keigo</creator><creator>Nakano, Tamio</creator><creator>Miyake, Yoshihiro</creator><general>Mdpi</general><general>MDPI AG</general><general>MDPI</general><scope>17B</scope><scope>AOWDO</scope><scope>ARHDP</scope><scope>BLEPL</scope><scope>DTL</scope><scope>DVR</scope><scope>EGQ</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>K9.</scope><scope>M0S</scope><scope>M1P</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0001-6300-4373</orcidid></search><sort><creationdate>20200522</creationdate><title>Speech Discrimination in Real-World Group Communication Using Audio-Motion Multimodal Sensing</title><author>Nozawa, Takayuki ; Uchiyama, Mizuki ; Honda, Keigo ; Nakano, Tamio ; Miyake, Yoshihiro</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c579t-5d2c84697701563cb75231f0e76e92eb97a829fecbe970e626e7422e01a43ab33</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Accuracy</topic><topic>Audio data</topic><topic>Chemistry</topic><topic>Chemistry, Analytical</topic><topic>Conferences</topic><topic>Data acquisition</topic><topic>Discrimination</topic><topic>Engineering</topic><topic>Engineering, Electrical & Electronic</topic><topic>Frequency ranges</topic><topic>Group communication</topic><topic>Hypotheses</topic><topic>Instruments & Instrumentation</topic><topic>multimodal sensing</topic><topic>physical motion</topic><topic>Physical Sciences</topic><topic>Science & Technology</topic><topic>sensor fusion</topic><topic>Sensors</topic><topic>smartphone</topic><topic>Smartphones</topic><topic>Speech</topic><topic>speech discrimination</topic><topic>Students</topic><topic>Technology</topic><topic>Verbal communication</topic><topic>Voice recognition</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Nozawa, Takayuki</creatorcontrib><creatorcontrib>Uchiyama, Mizuki</creatorcontrib><creatorcontrib>Honda, Keigo</creatorcontrib><creatorcontrib>Nakano, Tamio</creatorcontrib><creatorcontrib>Miyake, Yoshihiro</creatorcontrib><collection>Web of Knowledge</collection><collection>Web of Science - Science Citation Index Expanded - 2020</collection><collection>Web of Science - Social Sciences Citation Index – 2020</collection><collection>Web of Science Core Collection</collection><collection>Science Citation Index Expanded</collection><collection>Social Sciences Citation Index</collection><collection>Web of Science Primary (SCIE, SSCI & AHCI)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Health & Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>Health & Medical Collection (Alumni Edition)</collection><collection>Medical Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>Sensors (Basel, Switzerland)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Nozawa, Takayuki</au><au>Uchiyama, Mizuki</au><au>Honda, Keigo</au><au>Nakano, Tamio</au><au>Miyake, Yoshihiro</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Speech Discrimination in Real-World Group Communication Using Audio-Motion Multimodal Sensing</atitle><jtitle>Sensors (Basel, Switzerland)</jtitle><stitle>SENSORS-BASEL</stitle><addtitle>Sensors (Basel)</addtitle><date>2020-05-22</date><risdate>2020</risdate><volume>20</volume><issue>10</issue><spage>2948</spage><pages>2948-</pages><artnum>2948</artnum><issn>1424-8220</issn><eissn>1424-8220</eissn><abstract>Speech discrimination that determines whether a participant is speaking at a given moment is essential in investigating human verbal communication. Specifically, in dynamic real-world situations where multiple people participate in, and form, groups in the same space, simultaneous speakers render speech discrimination that is solely based on audio sensing difficult. In this study, we focused on physical activity during speech, and hypothesized that combining audio and physical motion data acquired by wearable sensors can improve speech discrimination. Thus, utterance and physical activity data of students in a university participatory class were recorded, using smartphones worn around their neck. First, we tested the temporal relationship between manually identified utterances and physical motions and confirmed that physical activities in wide-frequency ranges co-occurred with utterances. Second, we trained and tested classifiers for each participant and found a higher performance with the audio-motion classifier (average accuracy 92.2%) than both the audio-only (80.4%) and motion-only (87.8%) classifiers. Finally, we tested inter-individual classification and obtained a higher performance with the audio-motion combined classifier (83.2%) than the audio-only (67.7%) and motion-only (71.9%) classifiers. These results show that audio-motion multimodal sensing using widely available smartphones can provide effective utterance discrimination in dynamic group communications.</abstract><cop>BASEL</cop><pub>Mdpi</pub><pmid>32456031</pmid><doi>10.3390/s20102948</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0001-6300-4373</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1424-8220
ispartof	Sensors (Basel, Switzerland), 2020-05, Vol.20 (10), p.2948, Article 2948
issn	1424-8220 1424-8220
language	eng
recordid	cdi_pubmed_primary_32456031
source	DOAJ Directory of Open Access Journals; MDPI - Multidisciplinary Digital Publishing Institute; Web of Science - Science Citation Index Expanded - 2020<img src="https://exlibris-pub.s3.amazonaws.com/fromwos-v2.jpg" />; EZB-FREE-00999 freely available EZB journals; PubMed Central; Web of Science - Social Sciences Citation Index – 2020<img src="https://exlibris-pub.s3.amazonaws.com/fromwos-v2.jpg" />; Free Full-Text Journals in Chemistry
subjects	Accuracy Audio data Chemistry Chemistry, Analytical Conferences Data acquisition Discrimination Engineering Engineering, Electrical & Electronic Frequency ranges Group communication Hypotheses Instruments & Instrumentation multimodal sensing physical motion Physical Sciences Science & Technology sensor fusion Sensors smartphone Smartphones Speech speech discrimination Students Technology Verbal communication Voice recognition
title	Speech Discrimination in Real-World Group Communication Using Audio-Motion Multimodal Sensing
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-14T16%3A54%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Speech%20Discrimination%20in%20Real-World%20Group%20Communication%20Using%20Audio-Motion%20Multimodal%20Sensing&rft.jtitle=Sensors%20(Basel,%20Switzerland)&rft.au=Nozawa,%20Takayuki&rft.date=2020-05-22&rft.volume=20&rft.issue=10&rft.spage=2948&rft.pages=2948-&rft.artnum=2948&rft.issn=1424-8220&rft.eissn=1424-8220&rft_id=info:doi/10.3390/s20102948&rft_dat=%3Cproquest_pubme%3E2407313608%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2406785287&rft_id=info:pmid/32456031&rft_doaj_id=oai_doaj_org_article_c1e76932cc854397957b4b5a36a9dd88&rfr_iscdi=true