Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network

Speech emotion recognition is a challenging but important task in human computer interaction (HCI). As technology and understanding of emotion are progressing, it is necessary to design robust and reliable emotion recognition systems that are suitable for real-world applications both to enhance anal...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2020, Vol.8, p.61672-61686
Hauptverfasser:	Ho, Ngoc-Huynh, Yang, Hyung-Jeong, Kim, Soo-Hyung, Lee, Gueesang
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms audio features Bit error rate Coders Datasets Decision analysis Decision making Emotion recognition Emotions Feature extraction Hidden Markov models Human performance Human-computer interface Mel frequency cepstral coefficient Motion capture multi-level multi-head fusion attention Neural networks Recurrent neural networks RNN Speech emotion recognition Speech recognition textual features
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	61686
container_issue
container_start_page	61672
container_title	IEEE access
container_volume	8
creator	Ho, Ngoc-Huynh Yang, Hyung-Jeong Kim, Soo-Hyung Lee, Gueesang
description	Speech emotion recognition is a challenging but important task in human computer interaction (HCI). As technology and understanding of emotion are progressing, it is necessary to design robust and reliable emotion recognition systems that are suitable for real-world applications both to enhance analytical abilities supporting human decision making and to design human-machine interfaces (HMI) that assist efficient communication. This paper presents a multimodal approach for speech emotion recognition based on Multi-Level Multi-Head Fusion Attention mechanism and recurrent neural network (RNN). The proposed structure has inputs of two modalities: audio and text. For audio features, we determine the mel-frequency cepstrum (MFCC) from raw signals using the OpenSMILE toolbox. Further, we use pre-trained model of bidirectional encoder representations from transformers (BERT) for embedding text information. These features are fed parallelly into the self-attention mechanism base RNNs to exploit the context for each timestamp, then we fuse all representatives using multi-head attention technique to predict emotional states. Our experimental results on the three databases: Interactive Emotional Motion Capture (IEMOCAP), Multimodal EmotionLines Dataset (MELD), and CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI), reveal that the combination of the two modalities achieves better performance than using single models. Quantitative and qualitative evaluations on all introduced datasets demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.
doi_str_mv	10.1109/ACCESS.2020.2984368
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1109_ACCESS_2020_2984368</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9050806</ieee_id><doaj_id>oai_doaj_org_article_63cdbe69764746259122fabe473948fe</doaj_id><sourcerecordid>2453689797</sourcerecordid><originalsourceid>FETCH-LOGICAL-c408t-a8ea3b56c61b4aa5e119685cd348cc7a3cc1ef4bc2e2597fbf6edf9ee47b84a3</originalsourceid><addsrcrecordid>eNpNUctOwzAQjBBIIOALuETinOJXHPtYqvKQCkgUzpbjrEtKGhc7AXHh23GaCuGLR6OZ2dVOklxgNMEYyavpbDZfLicEETQhUjDKxUFyQjCXGc0pP_yHj5PzENYoPhGpvDhJfh76pqs3rtJNOt1uvdPmLXU2XW4BIppvXFe7Nn0G41ZtvcOvoW5X6c6XLeATmj2-A12lN30YNNOug3ZQZ9c6QDX4e-8jlT5C7-OsR-i-nH8_S46sbgKc7__T5OVm_jK7yxZPt_ez6SIzDIku0wI0LXNuOC6Z1jlgLLnITUWZMKbQ1BgMlpWGAMllYUvLobISgBWlYJqeJvdjbOX0Wm19vdH-Wzldqx3h_Epp39WmAcWpqUrgsuCsYDymYUKsLmMSlUxYiFmXY1Y81kcPoVNr1_s2bq8IizcWspBFVNFRZbwLwYP9m4qRGmpTY21qqE3ta4uui9FVA8CfQ6IcCcTpLzHBlbA</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2453689797</pqid></control><display><type>article</type><title>Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network</title><source>IEEE Open Access Journals</source><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><creator>Ho, Ngoc-Huynh ; Yang, Hyung-Jeong ; Kim, Soo-Hyung ; Lee, Gueesang</creator><creatorcontrib>Ho, Ngoc-Huynh ; Yang, Hyung-Jeong ; Kim, Soo-Hyung ; Lee, Gueesang</creatorcontrib><description>Speech emotion recognition is a challenging but important task in human computer interaction (HCI). As technology and understanding of emotion are progressing, it is necessary to design robust and reliable emotion recognition systems that are suitable for real-world applications both to enhance analytical abilities supporting human decision making and to design human-machine interfaces (HMI) that assist efficient communication. This paper presents a multimodal approach for speech emotion recognition based on Multi-Level Multi-Head Fusion Attention mechanism and recurrent neural network (RNN). The proposed structure has inputs of two modalities: audio and text. For audio features, we determine the mel-frequency cepstrum (MFCC) from raw signals using the OpenSMILE toolbox. Further, we use pre-trained model of bidirectional encoder representations from transformers (BERT) for embedding text information. These features are fed parallelly into the self-attention mechanism base RNNs to exploit the context for each timestamp, then we fuse all representatives using multi-head attention technique to predict emotional states. Our experimental results on the three databases: Interactive Emotional Motion Capture (IEMOCAP), Multimodal EmotionLines Dataset (MELD), and CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI), reveal that the combination of the two modalities achieves better performance than using single models. Quantitative and qualitative evaluations on all introduced datasets demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2020.2984368</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Algorithms ; audio features ; Bit error rate ; Coders ; Datasets ; Decision analysis ; Decision making ; Emotion recognition ; Emotions ; Feature extraction ; Hidden Markov models ; Human performance ; Human-computer interface ; Mel frequency cepstral coefficient ; Motion capture ; multi-level multi-head fusion attention ; Neural networks ; Recurrent neural networks ; RNN ; Speech emotion recognition ; Speech recognition ; textual features</subject><ispartof>IEEE access, 2020, Vol.8, p.61672-61686</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c408t-a8ea3b56c61b4aa5e119685cd348cc7a3cc1ef4bc2e2597fbf6edf9ee47b84a3</citedby><cites>FETCH-LOGICAL-c408t-a8ea3b56c61b4aa5e119685cd348cc7a3cc1ef4bc2e2597fbf6edf9ee47b84a3</cites><orcidid>0000-0003-3024-5060 ; 0000-0002-8756-1382 ; 0000-0003-3575-5035 ; 0000-0002-7539-2016</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9050806$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,860,2096,4010,27610,27900,27901,27902,54908</link.rule.ids></links><search><creatorcontrib>Ho, Ngoc-Huynh</creatorcontrib><creatorcontrib>Yang, Hyung-Jeong</creatorcontrib><creatorcontrib>Kim, Soo-Hyung</creatorcontrib><creatorcontrib>Lee, Gueesang</creatorcontrib><title>Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network</title><title>IEEE access</title><addtitle>Access</addtitle><description>Speech emotion recognition is a challenging but important task in human computer interaction (HCI). As technology and understanding of emotion are progressing, it is necessary to design robust and reliable emotion recognition systems that are suitable for real-world applications both to enhance analytical abilities supporting human decision making and to design human-machine interfaces (HMI) that assist efficient communication. This paper presents a multimodal approach for speech emotion recognition based on Multi-Level Multi-Head Fusion Attention mechanism and recurrent neural network (RNN). The proposed structure has inputs of two modalities: audio and text. For audio features, we determine the mel-frequency cepstrum (MFCC) from raw signals using the OpenSMILE toolbox. Further, we use pre-trained model of bidirectional encoder representations from transformers (BERT) for embedding text information. These features are fed parallelly into the self-attention mechanism base RNNs to exploit the context for each timestamp, then we fuse all representatives using multi-head attention technique to predict emotional states. Our experimental results on the three databases: Interactive Emotional Motion Capture (IEMOCAP), Multimodal EmotionLines Dataset (MELD), and CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI), reveal that the combination of the two modalities achieves better performance than using single models. Quantitative and qualitative evaluations on all introduced datasets demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.</description><subject>Algorithms</subject><subject>audio features</subject><subject>Bit error rate</subject><subject>Coders</subject><subject>Datasets</subject><subject>Decision analysis</subject><subject>Decision making</subject><subject>Emotion recognition</subject><subject>Emotions</subject><subject>Feature extraction</subject><subject>Hidden Markov models</subject><subject>Human performance</subject><subject>Human-computer interface</subject><subject>Mel frequency cepstral coefficient</subject><subject>Motion capture</subject><subject>multi-level multi-head fusion attention</subject><subject>Neural networks</subject><subject>Recurrent neural networks</subject><subject>RNN</subject><subject>Speech emotion recognition</subject><subject>Speech recognition</subject><subject>textual features</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNUctOwzAQjBBIIOALuETinOJXHPtYqvKQCkgUzpbjrEtKGhc7AXHh23GaCuGLR6OZ2dVOklxgNMEYyavpbDZfLicEETQhUjDKxUFyQjCXGc0pP_yHj5PzENYoPhGpvDhJfh76pqs3rtJNOt1uvdPmLXU2XW4BIppvXFe7Nn0G41ZtvcOvoW5X6c6XLeATmj2-A12lN30YNNOug3ZQZ9c6QDX4e-8jlT5C7-OsR-i-nH8_S46sbgKc7__T5OVm_jK7yxZPt_ez6SIzDIku0wI0LXNuOC6Z1jlgLLnITUWZMKbQ1BgMlpWGAMllYUvLobISgBWlYJqeJvdjbOX0Wm19vdH-Wzldqx3h_Epp39WmAcWpqUrgsuCsYDymYUKsLmMSlUxYiFmXY1Y81kcPoVNr1_s2bq8IizcWspBFVNFRZbwLwYP9m4qRGmpTY21qqE3ta4uui9FVA8CfQ6IcCcTpLzHBlbA</recordid><startdate>2020</startdate><enddate>2020</enddate><creator>Ho, Ngoc-Huynh</creator><creator>Yang, Hyung-Jeong</creator><creator>Kim, Soo-Hyung</creator><creator>Lee, Gueesang</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-3024-5060</orcidid><orcidid>https://orcid.org/0000-0002-8756-1382</orcidid><orcidid>https://orcid.org/0000-0003-3575-5035</orcidid><orcidid>https://orcid.org/0000-0002-7539-2016</orcidid></search><sort><creationdate>2020</creationdate><title>Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network</title><author>Ho, Ngoc-Huynh ; Yang, Hyung-Jeong ; Kim, Soo-Hyung ; Lee, Gueesang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c408t-a8ea3b56c61b4aa5e119685cd348cc7a3cc1ef4bc2e2597fbf6edf9ee47b84a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Algorithms</topic><topic>audio features</topic><topic>Bit error rate</topic><topic>Coders</topic><topic>Datasets</topic><topic>Decision analysis</topic><topic>Decision making</topic><topic>Emotion recognition</topic><topic>Emotions</topic><topic>Feature extraction</topic><topic>Hidden Markov models</topic><topic>Human performance</topic><topic>Human-computer interface</topic><topic>Mel frequency cepstral coefficient</topic><topic>Motion capture</topic><topic>multi-level multi-head fusion attention</topic><topic>Neural networks</topic><topic>Recurrent neural networks</topic><topic>RNN</topic><topic>Speech emotion recognition</topic><topic>Speech recognition</topic><topic>textual features</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ho, Ngoc-Huynh</creatorcontrib><creatorcontrib>Yang, Hyung-Jeong</creatorcontrib><creatorcontrib>Kim, Soo-Hyung</creatorcontrib><creatorcontrib>Lee, Gueesang</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ho, Ngoc-Huynh</au><au>Yang, Hyung-Jeong</au><au>Kim, Soo-Hyung</au><au>Lee, Gueesang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2020</date><risdate>2020</risdate><volume>8</volume><spage>61672</spage><epage>61686</epage><pages>61672-61686</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Speech emotion recognition is a challenging but important task in human computer interaction (HCI). As technology and understanding of emotion are progressing, it is necessary to design robust and reliable emotion recognition systems that are suitable for real-world applications both to enhance analytical abilities supporting human decision making and to design human-machine interfaces (HMI) that assist efficient communication. This paper presents a multimodal approach for speech emotion recognition based on Multi-Level Multi-Head Fusion Attention mechanism and recurrent neural network (RNN). The proposed structure has inputs of two modalities: audio and text. For audio features, we determine the mel-frequency cepstrum (MFCC) from raw signals using the OpenSMILE toolbox. Further, we use pre-trained model of bidirectional encoder representations from transformers (BERT) for embedding text information. These features are fed parallelly into the self-attention mechanism base RNNs to exploit the context for each timestamp, then we fuse all representatives using multi-head attention technique to predict emotional states. Our experimental results on the three databases: Interactive Emotional Motion Capture (IEMOCAP), Multimodal EmotionLines Dataset (MELD), and CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI), reveal that the combination of the two modalities achieves better performance than using single models. Quantitative and qualitative evaluations on all introduced datasets demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2020.2984368</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0003-3024-5060</orcidid><orcidid>https://orcid.org/0000-0002-8756-1382</orcidid><orcidid>https://orcid.org/0000-0003-3575-5035</orcidid><orcidid>https://orcid.org/0000-0002-7539-2016</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2169-3536
ispartof	IEEE access, 2020, Vol.8, p.61672-61686
issn	2169-3536 2169-3536
language	eng
recordid	cdi_crossref_primary_10_1109_ACCESS_2020_2984368
source	IEEE Open Access Journals; DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals
subjects	Algorithms audio features Bit error rate Coders Datasets Decision analysis Decision making Emotion recognition Emotions Feature extraction Hidden Markov models Human performance Human-computer interface Mel frequency cepstral coefficient Motion capture multi-level multi-head fusion attention Neural networks Recurrent neural networks RNN Speech emotion recognition Speech recognition textual features
title	Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-01T19%3A32%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multimodal%20Approach%20of%20Speech%20Emotion%20Recognition%20Using%20Multi-Level%20Multi-Head%20Fusion%20Attention-Based%20Recurrent%20Neural%20Network&rft.jtitle=IEEE%20access&rft.au=Ho,%20Ngoc-Huynh&rft.date=2020&rft.volume=8&rft.spage=61672&rft.epage=61686&rft.pages=61672-61686&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2020.2984368&rft_dat=%3Cproquest_cross%3E2453689797%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2453689797&rft_id=info:pmid/&rft_ieee_id=9050806&rft_doaj_id=oai_doaj_org_article_63cdbe69764746259122fabe473948fe&rfr_iscdi=true