Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice

This paper is devoted to tracking dynamics of psycho-emotional state based on analysis of the user’s facial video and voice. We propose a novel technology with personalized acoustic and visual lightweight neural network models that can be launched in real-time on any laptop or even mobile device. At...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Pattern recognition and image analysis 2022-09, Vol.32 (3), p.665-671
Hauptverfasser:	Savchenko, A. V., Savchenko, L. V.
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial neural networks Classifiers Computer Science Customization Electronic devices Emotion recognition Emotional factors Emotions Face recognition Feature extraction Image Processing and Computer Vision Neural networks Pattern Recognition Real time SELECTED PAPERS OF THE 8th INTERNATIONAL WORKSHOP “IMAGE MINING. THEORY AND APPLICATIONS” Speech recognition Tracking
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	671
container_issue	3
container_start_page	665
container_title	Pattern recognition and image analysis
container_volume	32
creator	Savchenko, A. V. Savchenko, L. V.
description	This paper is devoted to tracking dynamics of psycho-emotional state based on analysis of the user’s facial video and voice. We propose a novel technology with personalized acoustic and visual lightweight neural network models that can be launched in real-time on any laptop or even mobile device. At first, two separate user-independent classifiers (feed-forward neural networks) are trained for speech emotion and facial expression recognition in video. The former extracts acoustic features with OpenL3 or OpenSmile frameworks. The latter is based on preliminary extraction of emotional features from each frame with a pre-trained convolutional neural network. Next, both classifiers are fine-tuned using a small number of short emotional videos that should be available for each user. The face of a user is identified during the real-time tracking of emotional state to choose the concrete neural networks. The final decision about current emotion in a short time frame is predicted by blending the outputs of personalized audio and video classifiers. It is experimentally demonstrated for the Russian Acted Multimodal Affective Set that the proposed approach makes it possible to increase the emotion recognition accuracy by 2–15%.
doi_str_mv	10.1134/S1054661822030397
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2726042356</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2726042356</sourcerecordid><originalsourceid>FETCH-LOGICAL-c316t-9d1d1958ccae3bafb6bd3e50a0ade075ccf441dc35972d1a5a3581b9c656df483</originalsourceid><addsrcrecordid>eNp1kM1KAzEURgdRsFYfwF3A9Wh-JunMspZWhYrS2m6HNLlTUtpJzZ0B62P4xGao4EJc5SbfOR_hJsk1o7eMiexuzqjMlGI551RQUQxOkh6TUqaKM34a5xinXX6eXCBuKKU5K3gv-Rq21vl06bDVWzLydePq1rdIZmD8unaN8zXxFRnvfDdGZt7oBoiriSbP7bZx6QIhkPkBG9iRe41gSVReIWCHu894n8E-AEIdzZ-6iTYudo0_ugDjIxJdW7L0zsBlclbpLcLVz9lPFpPx2-gxnb48PI2G09QIppq0sMyyQubGaBArXa3UygqQVFNtgQ6kMVWWMWuELAbcMi21kDlbFUZJZassF_3k5ti7D_69BWzKjW9D_DOWfMAVzbiQKlLsSJngEQNU5T64nQ6HktGyW335Z_XR4UcHI1uvIfw2_y99A7qtiCs</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2726042356</pqid></control><display><type>article</type><title>Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice</title><source>Springer Nature - Complete Springer Journals</source><creator>Savchenko, A. V. ; Savchenko, L. V.</creator><creatorcontrib>Savchenko, A. V. ; Savchenko, L. V.</creatorcontrib><description>This paper is devoted to tracking dynamics of psycho-emotional state based on analysis of the user’s facial video and voice. We propose a novel technology with personalized acoustic and visual lightweight neural network models that can be launched in real-time on any laptop or even mobile device. At first, two separate user-independent classifiers (feed-forward neural networks) are trained for speech emotion and facial expression recognition in video. The former extracts acoustic features with OpenL3 or OpenSmile frameworks. The latter is based on preliminary extraction of emotional features from each frame with a pre-trained convolutional neural network. Next, both classifiers are fine-tuned using a small number of short emotional videos that should be available for each user. The face of a user is identified during the real-time tracking of emotional state to choose the concrete neural networks. The final decision about current emotion in a short time frame is predicted by blending the outputs of personalized audio and video classifiers. It is experimentally demonstrated for the Russian Acted Multimodal Affective Set that the proposed approach makes it possible to increase the emotion recognition accuracy by 2–15%.</description><identifier>ISSN: 1054-6618</identifier><identifier>EISSN: 1555-6212</identifier><identifier>DOI: 10.1134/S1054661822030397</identifier><language>eng</language><publisher>Moscow: Pleiades Publishing</publisher><subject>Artificial neural networks ; Classifiers ; Computer Science ; Customization ; Electronic devices ; Emotion recognition ; Emotional factors ; Emotions ; Face recognition ; Feature extraction ; Image Processing and Computer Vision ; Neural networks ; Pattern Recognition ; Real time ; SELECTED PAPERS OF THE 8th INTERNATIONAL WORKSHOP “IMAGE MINING. THEORY AND APPLICATIONS” ; Speech recognition ; Tracking</subject><ispartof>Pattern recognition and image analysis, 2022-09, Vol.32 (3), p.665-671</ispartof><rights>Pleiades Publishing, Ltd. 2022. ISSN 1054-6618, Pattern Recognition and Image Analysis, 2022, Vol. 32, No. 3, pp. 665–671. © Pleiades Publishing, Ltd., 2022.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c316t-9d1d1958ccae3bafb6bd3e50a0ade075ccf441dc35972d1a5a3581b9c656df483</citedby><cites>FETCH-LOGICAL-c316t-9d1d1958ccae3bafb6bd3e50a0ade075ccf441dc35972d1a5a3581b9c656df483</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1134/S1054661822030397$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1134/S1054661822030397$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,776,780,27901,27902,41464,42533,51294</link.rule.ids></links><search><creatorcontrib>Savchenko, A. V.</creatorcontrib><creatorcontrib>Savchenko, L. V.</creatorcontrib><title>Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice</title><title>Pattern recognition and image analysis</title><addtitle>Pattern Recognit. Image Anal</addtitle><description>This paper is devoted to tracking dynamics of psycho-emotional state based on analysis of the user’s facial video and voice. We propose a novel technology with personalized acoustic and visual lightweight neural network models that can be launched in real-time on any laptop or even mobile device. At first, two separate user-independent classifiers (feed-forward neural networks) are trained for speech emotion and facial expression recognition in video. The former extracts acoustic features with OpenL3 or OpenSmile frameworks. The latter is based on preliminary extraction of emotional features from each frame with a pre-trained convolutional neural network. Next, both classifiers are fine-tuned using a small number of short emotional videos that should be available for each user. The face of a user is identified during the real-time tracking of emotional state to choose the concrete neural networks. The final decision about current emotion in a short time frame is predicted by blending the outputs of personalized audio and video classifiers. It is experimentally demonstrated for the Russian Acted Multimodal Affective Set that the proposed approach makes it possible to increase the emotion recognition accuracy by 2–15%.</description><subject>Artificial neural networks</subject><subject>Classifiers</subject><subject>Computer Science</subject><subject>Customization</subject><subject>Electronic devices</subject><subject>Emotion recognition</subject><subject>Emotional factors</subject><subject>Emotions</subject><subject>Face recognition</subject><subject>Feature extraction</subject><subject>Image Processing and Computer Vision</subject><subject>Neural networks</subject><subject>Pattern Recognition</subject><subject>Real time</subject><subject>SELECTED PAPERS OF THE 8th INTERNATIONAL WORKSHOP “IMAGE MINING. THEORY AND APPLICATIONS”</subject><subject>Speech recognition</subject><subject>Tracking</subject><issn>1054-6618</issn><issn>1555-6212</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNp1kM1KAzEURgdRsFYfwF3A9Wh-JunMspZWhYrS2m6HNLlTUtpJzZ0B62P4xGao4EJc5SbfOR_hJsk1o7eMiexuzqjMlGI551RQUQxOkh6TUqaKM34a5xinXX6eXCBuKKU5K3gv-Rq21vl06bDVWzLydePq1rdIZmD8unaN8zXxFRnvfDdGZt7oBoiriSbP7bZx6QIhkPkBG9iRe41gSVReIWCHu894n8E-AEIdzZ-6iTYudo0_ugDjIxJdW7L0zsBlclbpLcLVz9lPFpPx2-gxnb48PI2G09QIppq0sMyyQubGaBArXa3UygqQVFNtgQ6kMVWWMWuELAbcMi21kDlbFUZJZassF_3k5ti7D_69BWzKjW9D_DOWfMAVzbiQKlLsSJngEQNU5T64nQ6HktGyW335Z_XR4UcHI1uvIfw2_y99A7qtiCs</recordid><startdate>20220901</startdate><enddate>20220901</enddate><creator>Savchenko, A. V.</creator><creator>Savchenko, L. V.</creator><general>Pleiades Publishing</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20220901</creationdate><title>Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice</title><author>Savchenko, A. V. ; Savchenko, L. V.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c316t-9d1d1958ccae3bafb6bd3e50a0ade075ccf441dc35972d1a5a3581b9c656df483</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Artificial neural networks</topic><topic>Classifiers</topic><topic>Computer Science</topic><topic>Customization</topic><topic>Electronic devices</topic><topic>Emotion recognition</topic><topic>Emotional factors</topic><topic>Emotions</topic><topic>Face recognition</topic><topic>Feature extraction</topic><topic>Image Processing and Computer Vision</topic><topic>Neural networks</topic><topic>Pattern Recognition</topic><topic>Real time</topic><topic>SELECTED PAPERS OF THE 8th INTERNATIONAL WORKSHOP “IMAGE MINING. THEORY AND APPLICATIONS”</topic><topic>Speech recognition</topic><topic>Tracking</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Savchenko, A. V.</creatorcontrib><creatorcontrib>Savchenko, L. V.</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Pattern recognition and image analysis</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Savchenko, A. V.</au><au>Savchenko, L. V.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice</atitle><jtitle>Pattern recognition and image analysis</jtitle><stitle>Pattern Recognit. Image Anal</stitle><date>2022-09-01</date><risdate>2022</risdate><volume>32</volume><issue>3</issue><spage>665</spage><epage>671</epage><pages>665-671</pages><issn>1054-6618</issn><eissn>1555-6212</eissn><abstract>This paper is devoted to tracking dynamics of psycho-emotional state based on analysis of the user’s facial video and voice. We propose a novel technology with personalized acoustic and visual lightweight neural network models that can be launched in real-time on any laptop or even mobile device. At first, two separate user-independent classifiers (feed-forward neural networks) are trained for speech emotion and facial expression recognition in video. The former extracts acoustic features with OpenL3 or OpenSmile frameworks. The latter is based on preliminary extraction of emotional features from each frame with a pre-trained convolutional neural network. Next, both classifiers are fine-tuned using a small number of short emotional videos that should be available for each user. The face of a user is identified during the real-time tracking of emotional state to choose the concrete neural networks. The final decision about current emotion in a short time frame is predicted by blending the outputs of personalized audio and video classifiers. It is experimentally demonstrated for the Russian Acted Multimodal Affective Set that the proposed approach makes it possible to increase the emotion recognition accuracy by 2–15%.</abstract><cop>Moscow</cop><pub>Pleiades Publishing</pub><doi>10.1134/S1054661822030397</doi><tpages>7</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 1054-6618
ispartof	Pattern recognition and image analysis, 2022-09, Vol.32 (3), p.665-671
issn	1054-6618 1555-6212
language	eng
recordid	cdi_proquest_journals_2726042356
source	Springer Nature - Complete Springer Journals
subjects	Artificial neural networks Classifiers Computer Science Customization Electronic devices Emotion recognition Emotional factors Emotions Face recognition Feature extraction Image Processing and Computer Vision Neural networks Pattern Recognition Real time SELECTED PAPERS OF THE 8th INTERNATIONAL WORKSHOP “IMAGE MINING. THEORY AND APPLICATIONS” Speech recognition Tracking
title	Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-18T22%3A08%3A34IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Audio-Visual%20Continuous%20Recognition%20of%20Emotional%20State%20in%20a%20Multi-User%20System%20Based%20on%20Personalized%20Representation%20of%20Facial%20Expressions%20and%20Voice&rft.jtitle=Pattern%20recognition%20and%20image%20analysis&rft.au=Savchenko,%20A.%20V.&rft.date=2022-09-01&rft.volume=32&rft.issue=3&rft.spage=665&rft.epage=671&rft.pages=665-671&rft.issn=1054-6618&rft.eissn=1555-6212&rft_id=info:doi/10.1134/S1054661822030397&rft_dat=%3Cproquest_cross%3E2726042356%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2726042356&rft_id=info:pmid/&rfr_iscdi=true