Capturing Time Dynamics From Speech Using Neural Networks for Surgical Mask Detection

The importance of detecting whether a person wears a face mask while speaking has tremendously increased since the outbreak of SARS-CoV-2 (COVID-19), as wearing a mask can help to reduce the spread of the virus and mitigate the public health crisis. Besides affecting human speech characteristics rel...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE journal of biomedical and health informatics 2022-08, Vol.26 (8), p.4291-4302
Hauptverfasser:	Liu, Shuo, Mallol-Ragolta, Adria, Yan, Tianhao, Qian, Kun, Parada-Cabaleiro, Emilia, Hu, Bin, Schuller, Bjorn W.
Format:	Artikel
Sprache:	eng
Schlagworte:	Acoustics Artificial neural networks Audio data audio processing Audio signals Coders Computer architecture convolutional recurrent neural network convolutional transformer network COVID-19 Face mask detection Face recognition Feature extraction Information processing Long short-term memory Masks multi-head attention Neural networks Protective equipment Public health Severe acute respiratory syndrome coronavirus 2 Signal processing Spectrogram Speech Task analysis Transformers Viral diseases Viruses
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	4302
container_issue	8
container_start_page	4291
container_title	IEEE journal of biomedical and health informatics
container_volume	26
creator	Liu, Shuo Mallol-Ragolta, Adria Yan, Tianhao Qian, Kun Parada-Cabaleiro, Emilia Hu, Bin Schuller, Bjorn W.
description	The importance of detecting whether a person wears a face mask while speaking has tremendously increased since the outbreak of SARS-CoV-2 (COVID-19), as wearing a mask can help to reduce the spread of the virus and mitigate the public health crisis. Besides affecting human speech characteristics related to frequency, face masks cause temporal interferences in speech, altering the pace, rhythm, and pronunciation speed. In this regard, this paper presents two effective neural network models to detect surgical masks from audio. The proposed architectures are both based on Convolutional Neural Networks (CNNs), chosen as an optimal approach for the spatial processing of the audio signals. One architecture applies a Long Short-Term Memory (LSTM) network to model the time-dependencies. Through an additional attention mechanism, the LSTM-based architecture enables the extraction of more salient temporal information. The other architecture (named ConvTx) retrieves the relative position of a sequence through the positional encoder of a transformer module. In order to assess to which extent both architectures can complement each other when modelling temporal dynamics, we also explore the combination of LSTM and Transformers in three hybrid models. Finally, we also investigate whether data augmentation techniques, such as, using transitions between audio frames and considering gender-dependent frameworks might impact the performance of the proposed architectures. Our experimental results show that one of the hybrid models achieves the best performance, surpassing existing state-of-the-art results for the task at hand.
doi_str_mv	10.1109/JBHI.2022.3173128
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2703422744</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9770372</ieee_id><sourcerecordid>2703422744</sourcerecordid><originalsourceid>FETCH-LOGICAL-c326t-93d65d74e0f9141dfea5be55f22d6b0f0f9071440d286b6d04ecaa866d821cfd3</originalsourceid><addsrcrecordid>eNpdkEFPwkAQhTdGIwT5AcZLEy9ewN3Z7bY9KohoUA_AebNsp1igLe62Mfx7twE9OJeZvPnmZfIIuWZ0yBhN7l8fpy9DoABDziLOID4jXWAyHgDQ-Px3ZonokL5zG-or9lIiL0mHhyGA5EmXLEd6Xzc2L9fBIi8wGB9KXeTGBRNbFcF8j2g-g6Vr9-_YWL3zrf6u7NYFWWWDeWPXufHqm3bbYIw1mjqvyitykemdw_6p98hy8rQYTQezj-eX0cNsYDjIepDwVIZpJJBmCRMszVCHKwzDDCCVK5p5mUZMCJpCLFcypQKN1rGUaQzMZCnvkbuj795WXw26WhW5M7jb6RKrximQktFIxlJ49PYfuqkaW_rvFESUC4BItBQ7UsZWzlnM1N7mhbYHxahqY1dt7KqNXZ1i9zc3x5scEf_4JPKuEfAfrft7Aw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2703422744</pqid></control><display><type>article</type><title>Capturing Time Dynamics From Speech Using Neural Networks for Surgical Mask Detection</title><source>IEEE Electronic Library (IEL)</source><creator>Liu, Shuo ; Mallol-Ragolta, Adria ; Yan, Tianhao ; Qian, Kun ; Parada-Cabaleiro, Emilia ; Hu, Bin ; Schuller, Bjorn W.</creator><creatorcontrib>Liu, Shuo ; Mallol-Ragolta, Adria ; Yan, Tianhao ; Qian, Kun ; Parada-Cabaleiro, Emilia ; Hu, Bin ; Schuller, Bjorn W.</creatorcontrib><description>The importance of detecting whether a person wears a face mask while speaking has tremendously increased since the outbreak of SARS-CoV-2 (COVID-19), as wearing a mask can help to reduce the spread of the virus and mitigate the public health crisis. Besides affecting human speech characteristics related to frequency, face masks cause temporal interferences in speech, altering the pace, rhythm, and pronunciation speed. In this regard, this paper presents two effective neural network models to detect surgical masks from audio. The proposed architectures are both based on Convolutional Neural Networks (CNNs), chosen as an optimal approach for the spatial processing of the audio signals. One architecture applies a Long Short-Term Memory (LSTM) network to model the time-dependencies. Through an additional attention mechanism, the LSTM-based architecture enables the extraction of more salient temporal information. The other architecture (named ConvTx) retrieves the relative position of a sequence through the positional encoder of a transformer module. In order to assess to which extent both architectures can complement each other when modelling temporal dynamics, we also explore the combination of LSTM and Transformers in three hybrid models. Finally, we also investigate whether data augmentation techniques, such as, using transitions between audio frames and considering gender-dependent frameworks might impact the performance of the proposed architectures. Our experimental results show that one of the hybrid models achieves the best performance, surpassing existing state-of-the-art results for the task at hand.</description><identifier>ISSN: 2168-2194</identifier><identifier>EISSN: 2168-2208</identifier><identifier>DOI: 10.1109/JBHI.2022.3173128</identifier><identifier>PMID: 35522639</identifier><identifier>CODEN: IJBHA9</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Acoustics ; Artificial neural networks ; Audio data ; audio processing ; Audio signals ; Coders ; Computer architecture ; convolutional recurrent neural network ; convolutional transformer network ; COVID-19 ; Face mask detection ; Face recognition ; Feature extraction ; Information processing ; Long short-term memory ; Masks ; multi-head attention ; Neural networks ; Protective equipment ; Public health ; Severe acute respiratory syndrome coronavirus 2 ; Signal processing ; Spectrogram ; Speech ; Task analysis ; Transformers ; Viral diseases ; Viruses</subject><ispartof>IEEE journal of biomedical and health informatics, 2022-08, Vol.26 (8), p.4291-4302</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c326t-93d65d74e0f9141dfea5be55f22d6b0f0f9071440d286b6d04ecaa866d821cfd3</citedby><cites>FETCH-LOGICAL-c326t-93d65d74e0f9141dfea5be55f22d6b0f0f9071440d286b6d04ecaa866d821cfd3</cites><orcidid>0000-0001-6855-485X ; 0000-0002-1918-6453 ; 0000-0003-3514-5413 ; 0000-0002-6478-8699 ; 0000-0001-8133-8588 ; 0000-0003-1851-6075</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9770372$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9770372$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Liu, Shuo</creatorcontrib><creatorcontrib>Mallol-Ragolta, Adria</creatorcontrib><creatorcontrib>Yan, Tianhao</creatorcontrib><creatorcontrib>Qian, Kun</creatorcontrib><creatorcontrib>Parada-Cabaleiro, Emilia</creatorcontrib><creatorcontrib>Hu, Bin</creatorcontrib><creatorcontrib>Schuller, Bjorn W.</creatorcontrib><title>Capturing Time Dynamics From Speech Using Neural Networks for Surgical Mask Detection</title><title>IEEE journal of biomedical and health informatics</title><addtitle>JBHI</addtitle><description>The importance of detecting whether a person wears a face mask while speaking has tremendously increased since the outbreak of SARS-CoV-2 (COVID-19), as wearing a mask can help to reduce the spread of the virus and mitigate the public health crisis. Besides affecting human speech characteristics related to frequency, face masks cause temporal interferences in speech, altering the pace, rhythm, and pronunciation speed. In this regard, this paper presents two effective neural network models to detect surgical masks from audio. The proposed architectures are both based on Convolutional Neural Networks (CNNs), chosen as an optimal approach for the spatial processing of the audio signals. One architecture applies a Long Short-Term Memory (LSTM) network to model the time-dependencies. Through an additional attention mechanism, the LSTM-based architecture enables the extraction of more salient temporal information. The other architecture (named ConvTx) retrieves the relative position of a sequence through the positional encoder of a transformer module. In order to assess to which extent both architectures can complement each other when modelling temporal dynamics, we also explore the combination of LSTM and Transformers in three hybrid models. Finally, we also investigate whether data augmentation techniques, such as, using transitions between audio frames and considering gender-dependent frameworks might impact the performance of the proposed architectures. Our experimental results show that one of the hybrid models achieves the best performance, surpassing existing state-of-the-art results for the task at hand.</description><subject>Acoustics</subject><subject>Artificial neural networks</subject><subject>Audio data</subject><subject>audio processing</subject><subject>Audio signals</subject><subject>Coders</subject><subject>Computer architecture</subject><subject>convolutional recurrent neural network</subject><subject>convolutional transformer network</subject><subject>COVID-19</subject><subject>Face mask detection</subject><subject>Face recognition</subject><subject>Feature extraction</subject><subject>Information processing</subject><subject>Long short-term memory</subject><subject>Masks</subject><subject>multi-head attention</subject><subject>Neural networks</subject><subject>Protective equipment</subject><subject>Public health</subject><subject>Severe acute respiratory syndrome coronavirus 2</subject><subject>Signal processing</subject><subject>Spectrogram</subject><subject>Speech</subject><subject>Task analysis</subject><subject>Transformers</subject><subject>Viral diseases</subject><subject>Viruses</subject><issn>2168-2194</issn><issn>2168-2208</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkEFPwkAQhTdGIwT5AcZLEy9ewN3Z7bY9KohoUA_AebNsp1igLe62Mfx7twE9OJeZvPnmZfIIuWZ0yBhN7l8fpy9DoABDziLOID4jXWAyHgDQ-Px3ZonokL5zG-or9lIiL0mHhyGA5EmXLEd6Xzc2L9fBIi8wGB9KXeTGBRNbFcF8j2g-g6Vr9-_YWL3zrf6u7NYFWWWDeWPXufHqm3bbYIw1mjqvyitykemdw_6p98hy8rQYTQezj-eX0cNsYDjIepDwVIZpJJBmCRMszVCHKwzDDCCVK5p5mUZMCJpCLFcypQKN1rGUaQzMZCnvkbuj795WXw26WhW5M7jb6RKrximQktFIxlJ49PYfuqkaW_rvFESUC4BItBQ7UsZWzlnM1N7mhbYHxahqY1dt7KqNXZ1i9zc3x5scEf_4JPKuEfAfrft7Aw</recordid><startdate>20220801</startdate><enddate>20220801</enddate><creator>Liu, Shuo</creator><creator>Mallol-Ragolta, Adria</creator><creator>Yan, Tianhao</creator><creator>Qian, Kun</creator><creator>Parada-Cabaleiro, Emilia</creator><creator>Hu, Bin</creator><creator>Schuller, Bjorn W.</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QQ</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>JG9</scope><scope>JQ2</scope><scope>K9.</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>NAPCQ</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-6855-485X</orcidid><orcidid>https://orcid.org/0000-0002-1918-6453</orcidid><orcidid>https://orcid.org/0000-0003-3514-5413</orcidid><orcidid>https://orcid.org/0000-0002-6478-8699</orcidid><orcidid>https://orcid.org/0000-0001-8133-8588</orcidid><orcidid>https://orcid.org/0000-0003-1851-6075</orcidid></search><sort><creationdate>20220801</creationdate><title>Capturing Time Dynamics From Speech Using Neural Networks for Surgical Mask Detection</title><author>Liu, Shuo ; Mallol-Ragolta, Adria ; Yan, Tianhao ; Qian, Kun ; Parada-Cabaleiro, Emilia ; Hu, Bin ; Schuller, Bjorn W.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c326t-93d65d74e0f9141dfea5be55f22d6b0f0f9071440d286b6d04ecaa866d821cfd3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Acoustics</topic><topic>Artificial neural networks</topic><topic>Audio data</topic><topic>audio processing</topic><topic>Audio signals</topic><topic>Coders</topic><topic>Computer architecture</topic><topic>convolutional recurrent neural network</topic><topic>convolutional transformer network</topic><topic>COVID-19</topic><topic>Face mask detection</topic><topic>Face recognition</topic><topic>Feature extraction</topic><topic>Information processing</topic><topic>Long short-term memory</topic><topic>Masks</topic><topic>multi-head attention</topic><topic>Neural networks</topic><topic>Protective equipment</topic><topic>Public health</topic><topic>Severe acute respiratory syndrome coronavirus 2</topic><topic>Signal processing</topic><topic>Spectrogram</topic><topic>Speech</topic><topic>Task analysis</topic><topic>Transformers</topic><topic>Viral diseases</topic><topic>Viruses</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Liu, Shuo</creatorcontrib><creatorcontrib>Mallol-Ragolta, Adria</creatorcontrib><creatorcontrib>Yan, Tianhao</creatorcontrib><creatorcontrib>Qian, Kun</creatorcontrib><creatorcontrib>Parada-Cabaleiro, Emilia</creatorcontrib><creatorcontrib>Hu, Bin</creatorcontrib><creatorcontrib>Schuller, Bjorn W.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical & Transportation Engineering Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Nursing & Allied Health Premium</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE journal of biomedical and health informatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Liu, Shuo</au><au>Mallol-Ragolta, Adria</au><au>Yan, Tianhao</au><au>Qian, Kun</au><au>Parada-Cabaleiro, Emilia</au><au>Hu, Bin</au><au>Schuller, Bjorn W.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Capturing Time Dynamics From Speech Using Neural Networks for Surgical Mask Detection</atitle><jtitle>IEEE journal of biomedical and health informatics</jtitle><stitle>JBHI</stitle><date>2022-08-01</date><risdate>2022</risdate><volume>26</volume><issue>8</issue><spage>4291</spage><epage>4302</epage><pages>4291-4302</pages><issn>2168-2194</issn><eissn>2168-2208</eissn><coden>IJBHA9</coden><abstract>The importance of detecting whether a person wears a face mask while speaking has tremendously increased since the outbreak of SARS-CoV-2 (COVID-19), as wearing a mask can help to reduce the spread of the virus and mitigate the public health crisis. Besides affecting human speech characteristics related to frequency, face masks cause temporal interferences in speech, altering the pace, rhythm, and pronunciation speed. In this regard, this paper presents two effective neural network models to detect surgical masks from audio. The proposed architectures are both based on Convolutional Neural Networks (CNNs), chosen as an optimal approach for the spatial processing of the audio signals. One architecture applies a Long Short-Term Memory (LSTM) network to model the time-dependencies. Through an additional attention mechanism, the LSTM-based architecture enables the extraction of more salient temporal information. The other architecture (named ConvTx) retrieves the relative position of a sequence through the positional encoder of a transformer module. In order to assess to which extent both architectures can complement each other when modelling temporal dynamics, we also explore the combination of LSTM and Transformers in three hybrid models. Finally, we also investigate whether data augmentation techniques, such as, using transitions between audio frames and considering gender-dependent frameworks might impact the performance of the proposed architectures. Our experimental results show that one of the hybrid models achieves the best performance, surpassing existing state-of-the-art results for the task at hand.</abstract><cop>Piscataway</cop><pub>IEEE</pub><pmid>35522639</pmid><doi>10.1109/JBHI.2022.3173128</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0001-6855-485X</orcidid><orcidid>https://orcid.org/0000-0002-1918-6453</orcidid><orcidid>https://orcid.org/0000-0003-3514-5413</orcidid><orcidid>https://orcid.org/0000-0002-6478-8699</orcidid><orcidid>https://orcid.org/0000-0001-8133-8588</orcidid><orcidid>https://orcid.org/0000-0003-1851-6075</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 2168-2194
ispartof	IEEE journal of biomedical and health informatics, 2022-08, Vol.26 (8), p.4291-4302
issn	2168-2194 2168-2208
language	eng
recordid	cdi_proquest_journals_2703422744
source	IEEE Electronic Library (IEL)
subjects	Acoustics Artificial neural networks Audio data audio processing Audio signals Coders Computer architecture convolutional recurrent neural network convolutional transformer network COVID-19 Face mask detection Face recognition Feature extraction Information processing Long short-term memory Masks multi-head attention Neural networks Protective equipment Public health Severe acute respiratory syndrome coronavirus 2 Signal processing Spectrogram Speech Task analysis Transformers Viral diseases Viruses
title	Capturing Time Dynamics From Speech Using Neural Networks for Surgical Mask Detection
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T00%3A03%3A57IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Capturing%20Time%20Dynamics%20From%20Speech%20Using%20Neural%20Networks%20for%20Surgical%20Mask%20Detection&rft.jtitle=IEEE%20journal%20of%20biomedical%20and%20health%20informatics&rft.au=Liu,%20Shuo&rft.date=2022-08-01&rft.volume=26&rft.issue=8&rft.spage=4291&rft.epage=4302&rft.pages=4291-4302&rft.issn=2168-2194&rft.eissn=2168-2208&rft.coden=IJBHA9&rft_id=info:doi/10.1109/JBHI.2022.3173128&rft_dat=%3Cproquest_RIE%3E2703422744%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2703422744&rft_id=info:pmid/35522639&rft_ieee_id=9770372&rfr_iscdi=true