Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice

This paper is devoted to tracking dynamics of psycho-emotional state based on analysis of the user’s facial video and voice. We propose a novel technology with personalized acoustic and visual lightweight neural network models that can be launched in real-time on any laptop or even mobile device. At...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Pattern recognition and image analysis 2022-09, Vol.32 (3), p.665-671
Hauptverfasser: Savchenko, A. V., Savchenko, L. V.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 671
container_issue 3
container_start_page 665
container_title Pattern recognition and image analysis
container_volume 32
creator Savchenko, A. V.
Savchenko, L. V.
description This paper is devoted to tracking dynamics of psycho-emotional state based on analysis of the user’s facial video and voice. We propose a novel technology with personalized acoustic and visual lightweight neural network models that can be launched in real-time on any laptop or even mobile device. At first, two separate user-independent classifiers (feed-forward neural networks) are trained for speech emotion and facial expression recognition in video. The former extracts acoustic features with OpenL3 or OpenSmile frameworks. The latter is based on preliminary extraction of emotional features from each frame with a pre-trained convolutional neural network. Next, both classifiers are fine-tuned using a small number of short emotional videos that should be available for each user. The face of a user is identified during the real-time tracking of emotional state to choose the concrete neural networks. The final decision about current emotion in a short time frame is predicted by blending the outputs of personalized audio and video classifiers. It is experimentally demonstrated for the Russian Acted Multimodal Affective Set that the proposed approach makes it possible to increase the emotion recognition accuracy by 2–15%.
doi_str_mv 10.1134/S1054661822030397
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2726042356</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2726042356</sourcerecordid><originalsourceid>FETCH-LOGICAL-c316t-9d1d1958ccae3bafb6bd3e50a0ade075ccf441dc35972d1a5a3581b9c656df483</originalsourceid><addsrcrecordid>eNp1kM1KAzEURgdRsFYfwF3A9Wh-JunMspZWhYrS2m6HNLlTUtpJzZ0B62P4xGao4EJc5SbfOR_hJsk1o7eMiexuzqjMlGI551RQUQxOkh6TUqaKM34a5xinXX6eXCBuKKU5K3gv-Rq21vl06bDVWzLydePq1rdIZmD8unaN8zXxFRnvfDdGZt7oBoiriSbP7bZx6QIhkPkBG9iRe41gSVReIWCHu894n8E-AEIdzZ-6iTYudo0_ugDjIxJdW7L0zsBlclbpLcLVz9lPFpPx2-gxnb48PI2G09QIppq0sMyyQubGaBArXa3UygqQVFNtgQ6kMVWWMWuELAbcMi21kDlbFUZJZassF_3k5ti7D_69BWzKjW9D_DOWfMAVzbiQKlLsSJngEQNU5T64nQ6HktGyW335Z_XR4UcHI1uvIfw2_y99A7qtiCs</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2726042356</pqid></control><display><type>article</type><title>Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice</title><source>Springer Nature - Complete Springer Journals</source><creator>Savchenko, A. V. ; Savchenko, L. V.</creator><creatorcontrib>Savchenko, A. V. ; Savchenko, L. V.</creatorcontrib><description>This paper is devoted to tracking dynamics of psycho-emotional state based on analysis of the user’s facial video and voice. We propose a novel technology with personalized acoustic and visual lightweight neural network models that can be launched in real-time on any laptop or even mobile device. At first, two separate user-independent classifiers (feed-forward neural networks) are trained for speech emotion and facial expression recognition in video. The former extracts acoustic features with OpenL3 or OpenSmile frameworks. The latter is based on preliminary extraction of emotional features from each frame with a pre-trained convolutional neural network. Next, both classifiers are fine-tuned using a small number of short emotional videos that should be available for each user. The face of a user is identified during the real-time tracking of emotional state to choose the concrete neural networks. The final decision about current emotion in a short time frame is predicted by blending the outputs of personalized audio and video classifiers. It is experimentally demonstrated for the Russian Acted Multimodal Affective Set that the proposed approach makes it possible to increase the emotion recognition accuracy by 2–15%.</description><identifier>ISSN: 1054-6618</identifier><identifier>EISSN: 1555-6212</identifier><identifier>DOI: 10.1134/S1054661822030397</identifier><language>eng</language><publisher>Moscow: Pleiades Publishing</publisher><subject>Artificial neural networks ; Classifiers ; Computer Science ; Customization ; Electronic devices ; Emotion recognition ; Emotional factors ; Emotions ; Face recognition ; Feature extraction ; Image Processing and Computer Vision ; Neural networks ; Pattern Recognition ; Real time ; SELECTED PAPERS OF THE 8th INTERNATIONAL WORKSHOP “IMAGE MINING. THEORY AND APPLICATIONS” ; Speech recognition ; Tracking</subject><ispartof>Pattern recognition and image analysis, 2022-09, Vol.32 (3), p.665-671</ispartof><rights>Pleiades Publishing, Ltd. 2022. ISSN 1054-6618, Pattern Recognition and Image Analysis, 2022, Vol. 32, No. 3, pp. 665–671. © Pleiades Publishing, Ltd., 2022.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c316t-9d1d1958ccae3bafb6bd3e50a0ade075ccf441dc35972d1a5a3581b9c656df483</citedby><cites>FETCH-LOGICAL-c316t-9d1d1958ccae3bafb6bd3e50a0ade075ccf441dc35972d1a5a3581b9c656df483</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1134/S1054661822030397$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1134/S1054661822030397$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,776,780,27901,27902,41464,42533,51294</link.rule.ids></links><search><creatorcontrib>Savchenko, A. V.</creatorcontrib><creatorcontrib>Savchenko, L. V.</creatorcontrib><title>Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice</title><title>Pattern recognition and image analysis</title><addtitle>Pattern Recognit. Image Anal</addtitle><description>This paper is devoted to tracking dynamics of psycho-emotional state based on analysis of the user’s facial video and voice. We propose a novel technology with personalized acoustic and visual lightweight neural network models that can be launched in real-time on any laptop or even mobile device. At first, two separate user-independent classifiers (feed-forward neural networks) are trained for speech emotion and facial expression recognition in video. The former extracts acoustic features with OpenL3 or OpenSmile frameworks. The latter is based on preliminary extraction of emotional features from each frame with a pre-trained convolutional neural network. Next, both classifiers are fine-tuned using a small number of short emotional videos that should be available for each user. The face of a user is identified during the real-time tracking of emotional state to choose the concrete neural networks. The final decision about current emotion in a short time frame is predicted by blending the outputs of personalized audio and video classifiers. It is experimentally demonstrated for the Russian Acted Multimodal Affective Set that the proposed approach makes it possible to increase the emotion recognition accuracy by 2–15%.</description><subject>Artificial neural networks</subject><subject>Classifiers</subject><subject>Computer Science</subject><subject>Customization</subject><subject>Electronic devices</subject><subject>Emotion recognition</subject><subject>Emotional factors</subject><subject>Emotions</subject><subject>Face recognition</subject><subject>Feature extraction</subject><subject>Image Processing and Computer Vision</subject><subject>Neural networks</subject><subject>Pattern Recognition</subject><subject>Real time</subject><subject>SELECTED PAPERS OF THE 8th INTERNATIONAL WORKSHOP “IMAGE MINING. THEORY AND APPLICATIONS”</subject><subject>Speech recognition</subject><subject>Tracking</subject><issn>1054-6618</issn><issn>1555-6212</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNp1kM1KAzEURgdRsFYfwF3A9Wh-JunMspZWhYrS2m6HNLlTUtpJzZ0B62P4xGao4EJc5SbfOR_hJsk1o7eMiexuzqjMlGI551RQUQxOkh6TUqaKM34a5xinXX6eXCBuKKU5K3gv-Rq21vl06bDVWzLydePq1rdIZmD8unaN8zXxFRnvfDdGZt7oBoiriSbP7bZx6QIhkPkBG9iRe41gSVReIWCHu894n8E-AEIdzZ-6iTYudo0_ugDjIxJdW7L0zsBlclbpLcLVz9lPFpPx2-gxnb48PI2G09QIppq0sMyyQubGaBArXa3UygqQVFNtgQ6kMVWWMWuELAbcMi21kDlbFUZJZassF_3k5ti7D_69BWzKjW9D_DOWfMAVzbiQKlLsSJngEQNU5T64nQ6HktGyW335Z_XR4UcHI1uvIfw2_y99A7qtiCs</recordid><startdate>20220901</startdate><enddate>20220901</enddate><creator>Savchenko, A. V.</creator><creator>Savchenko, L. V.</creator><general>Pleiades Publishing</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20220901</creationdate><title>Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice</title><author>Savchenko, A. V. ; Savchenko, L. V.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c316t-9d1d1958ccae3bafb6bd3e50a0ade075ccf441dc35972d1a5a3581b9c656df483</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Artificial neural networks</topic><topic>Classifiers</topic><topic>Computer Science</topic><topic>Customization</topic><topic>Electronic devices</topic><topic>Emotion recognition</topic><topic>Emotional factors</topic><topic>Emotions</topic><topic>Face recognition</topic><topic>Feature extraction</topic><topic>Image Processing and Computer Vision</topic><topic>Neural networks</topic><topic>Pattern Recognition</topic><topic>Real time</topic><topic>SELECTED PAPERS OF THE 8th INTERNATIONAL WORKSHOP “IMAGE MINING. THEORY AND APPLICATIONS”</topic><topic>Speech recognition</topic><topic>Tracking</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Savchenko, A. V.</creatorcontrib><creatorcontrib>Savchenko, L. V.</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Pattern recognition and image analysis</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Savchenko, A. V.</au><au>Savchenko, L. V.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice</atitle><jtitle>Pattern recognition and image analysis</jtitle><stitle>Pattern Recognit. Image Anal</stitle><date>2022-09-01</date><risdate>2022</risdate><volume>32</volume><issue>3</issue><spage>665</spage><epage>671</epage><pages>665-671</pages><issn>1054-6618</issn><eissn>1555-6212</eissn><abstract>This paper is devoted to tracking dynamics of psycho-emotional state based on analysis of the user’s facial video and voice. We propose a novel technology with personalized acoustic and visual lightweight neural network models that can be launched in real-time on any laptop or even mobile device. At first, two separate user-independent classifiers (feed-forward neural networks) are trained for speech emotion and facial expression recognition in video. The former extracts acoustic features with OpenL3 or OpenSmile frameworks. The latter is based on preliminary extraction of emotional features from each frame with a pre-trained convolutional neural network. Next, both classifiers are fine-tuned using a small number of short emotional videos that should be available for each user. The face of a user is identified during the real-time tracking of emotional state to choose the concrete neural networks. The final decision about current emotion in a short time frame is predicted by blending the outputs of personalized audio and video classifiers. It is experimentally demonstrated for the Russian Acted Multimodal Affective Set that the proposed approach makes it possible to increase the emotion recognition accuracy by 2–15%.</abstract><cop>Moscow</cop><pub>Pleiades Publishing</pub><doi>10.1134/S1054661822030397</doi><tpages>7</tpages></addata></record>
fulltext fulltext
identifier ISSN: 1054-6618
ispartof Pattern recognition and image analysis, 2022-09, Vol.32 (3), p.665-671
issn 1054-6618
1555-6212
language eng
recordid cdi_proquest_journals_2726042356
source Springer Nature - Complete Springer Journals
subjects Artificial neural networks
Classifiers
Computer Science
Customization
Electronic devices
Emotion recognition
Emotional factors
Emotions
Face recognition
Feature extraction
Image Processing and Computer Vision
Neural networks
Pattern Recognition
Real time
SELECTED PAPERS OF THE 8th INTERNATIONAL WORKSHOP “IMAGE MINING. THEORY AND APPLICATIONS”
Speech recognition
Tracking
title Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-18T22%3A08%3A34IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Audio-Visual%20Continuous%20Recognition%20of%20Emotional%20State%20in%20a%20Multi-User%20System%20Based%20on%20Personalized%20Representation%20of%20Facial%20Expressions%20and%20Voice&rft.jtitle=Pattern%20recognition%20and%20image%20analysis&rft.au=Savchenko,%20A.%20V.&rft.date=2022-09-01&rft.volume=32&rft.issue=3&rft.spage=665&rft.epage=671&rft.pages=665-671&rft.issn=1054-6618&rft.eissn=1555-6212&rft_id=info:doi/10.1134/S1054661822030397&rft_dat=%3Cproquest_cross%3E2726042356%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2726042356&rft_id=info:pmid/&rfr_iscdi=true