Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice
This paper is devoted to tracking dynamics of psycho-emotional state based on analysis of the user’s facial video and voice. We propose a novel technology with personalized acoustic and visual lightweight neural network models that can be launched in real-time on any laptop or even mobile device. At...
Gespeichert in:
Veröffentlicht in: | Pattern recognition and image analysis 2022-09, Vol.32 (3), p.665-671 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 671 |
---|---|
container_issue | 3 |
container_start_page | 665 |
container_title | Pattern recognition and image analysis |
container_volume | 32 |
creator | Savchenko, A. V. Savchenko, L. V. |
description | This paper is devoted to tracking dynamics of psycho-emotional state based on analysis of the user’s facial video and voice. We propose a novel technology with personalized acoustic and visual lightweight neural network models that can be launched in real-time on any laptop or even mobile device. At first, two separate user-independent classifiers (feed-forward neural networks) are trained for speech emotion and facial expression recognition in video. The former extracts acoustic features with OpenL3 or OpenSmile frameworks. The latter is based on preliminary extraction of emotional features from each frame with a pre-trained convolutional neural network. Next, both classifiers are fine-tuned using a small number of short emotional videos that should be available for each user. The face of a user is identified during the real-time tracking of emotional state to choose the concrete neural networks. The final decision about current emotion in a short time frame is predicted by blending the outputs of personalized audio and video classifiers. It is experimentally demonstrated for the Russian Acted Multimodal Affective Set that the proposed approach makes it possible to increase the emotion recognition accuracy by 2–15%. |
doi_str_mv | 10.1134/S1054661822030397 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2726042356</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2726042356</sourcerecordid><originalsourceid>FETCH-LOGICAL-c316t-9d1d1958ccae3bafb6bd3e50a0ade075ccf441dc35972d1a5a3581b9c656df483</originalsourceid><addsrcrecordid>eNp1kM1KAzEURgdRsFYfwF3A9Wh-JunMspZWhYrS2m6HNLlTUtpJzZ0B62P4xGao4EJc5SbfOR_hJsk1o7eMiexuzqjMlGI551RQUQxOkh6TUqaKM34a5xinXX6eXCBuKKU5K3gv-Rq21vl06bDVWzLydePq1rdIZmD8unaN8zXxFRnvfDdGZt7oBoiriSbP7bZx6QIhkPkBG9iRe41gSVReIWCHu894n8E-AEIdzZ-6iTYudo0_ugDjIxJdW7L0zsBlclbpLcLVz9lPFpPx2-gxnb48PI2G09QIppq0sMyyQubGaBArXa3UygqQVFNtgQ6kMVWWMWuELAbcMi21kDlbFUZJZassF_3k5ti7D_69BWzKjW9D_DOWfMAVzbiQKlLsSJngEQNU5T64nQ6HktGyW335Z_XR4UcHI1uvIfw2_y99A7qtiCs</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2726042356</pqid></control><display><type>article</type><title>Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice</title><source>Springer Nature - Complete Springer Journals</source><creator>Savchenko, A. V. ; Savchenko, L. V.</creator><creatorcontrib>Savchenko, A. V. ; Savchenko, L. V.</creatorcontrib><description>This paper is devoted to tracking dynamics of psycho-emotional state based on analysis of the user’s facial video and voice. We propose a novel technology with personalized acoustic and visual lightweight neural network models that can be launched in real-time on any laptop or even mobile device. At first, two separate user-independent classifiers (feed-forward neural networks) are trained for speech emotion and facial expression recognition in video. The former extracts acoustic features with OpenL3 or OpenSmile frameworks. The latter is based on preliminary extraction of emotional features from each frame with a pre-trained convolutional neural network. Next, both classifiers are fine-tuned using a small number of short emotional videos that should be available for each user. The face of a user is identified during the real-time tracking of emotional state to choose the concrete neural networks. The final decision about current emotion in a short time frame is predicted by blending the outputs of personalized audio and video classifiers. It is experimentally demonstrated for the Russian Acted Multimodal Affective Set that the proposed approach makes it possible to increase the emotion recognition accuracy by 2–15%.</description><identifier>ISSN: 1054-6618</identifier><identifier>EISSN: 1555-6212</identifier><identifier>DOI: 10.1134/S1054661822030397</identifier><language>eng</language><publisher>Moscow: Pleiades Publishing</publisher><subject>Artificial neural networks ; Classifiers ; Computer Science ; Customization ; Electronic devices ; Emotion recognition ; Emotional factors ; Emotions ; Face recognition ; Feature extraction ; Image Processing and Computer Vision ; Neural networks ; Pattern Recognition ; Real time ; SELECTED PAPERS OF THE 8th INTERNATIONAL WORKSHOP “IMAGE MINING. THEORY AND APPLICATIONS” ; Speech recognition ; Tracking</subject><ispartof>Pattern recognition and image analysis, 2022-09, Vol.32 (3), p.665-671</ispartof><rights>Pleiades Publishing, Ltd. 2022. ISSN 1054-6618, Pattern Recognition and Image Analysis, 2022, Vol. 32, No. 3, pp. 665–671. © Pleiades Publishing, Ltd., 2022.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c316t-9d1d1958ccae3bafb6bd3e50a0ade075ccf441dc35972d1a5a3581b9c656df483</citedby><cites>FETCH-LOGICAL-c316t-9d1d1958ccae3bafb6bd3e50a0ade075ccf441dc35972d1a5a3581b9c656df483</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1134/S1054661822030397$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1134/S1054661822030397$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,776,780,27901,27902,41464,42533,51294</link.rule.ids></links><search><creatorcontrib>Savchenko, A. V.</creatorcontrib><creatorcontrib>Savchenko, L. V.</creatorcontrib><title>Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice</title><title>Pattern recognition and image analysis</title><addtitle>Pattern Recognit. Image Anal</addtitle><description>This paper is devoted to tracking dynamics of psycho-emotional state based on analysis of the user’s facial video and voice. We propose a novel technology with personalized acoustic and visual lightweight neural network models that can be launched in real-time on any laptop or even mobile device. At first, two separate user-independent classifiers (feed-forward neural networks) are trained for speech emotion and facial expression recognition in video. The former extracts acoustic features with OpenL3 or OpenSmile frameworks. The latter is based on preliminary extraction of emotional features from each frame with a pre-trained convolutional neural network. Next, both classifiers are fine-tuned using a small number of short emotional videos that should be available for each user. The face of a user is identified during the real-time tracking of emotional state to choose the concrete neural networks. The final decision about current emotion in a short time frame is predicted by blending the outputs of personalized audio and video classifiers. It is experimentally demonstrated for the Russian Acted Multimodal Affective Set that the proposed approach makes it possible to increase the emotion recognition accuracy by 2–15%.</description><subject>Artificial neural networks</subject><subject>Classifiers</subject><subject>Computer Science</subject><subject>Customization</subject><subject>Electronic devices</subject><subject>Emotion recognition</subject><subject>Emotional factors</subject><subject>Emotions</subject><subject>Face recognition</subject><subject>Feature extraction</subject><subject>Image Processing and Computer Vision</subject><subject>Neural networks</subject><subject>Pattern Recognition</subject><subject>Real time</subject><subject>SELECTED PAPERS OF THE 8th INTERNATIONAL WORKSHOP “IMAGE MINING. THEORY AND APPLICATIONS”</subject><subject>Speech recognition</subject><subject>Tracking</subject><issn>1054-6618</issn><issn>1555-6212</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNp1kM1KAzEURgdRsFYfwF3A9Wh-JunMspZWhYrS2m6HNLlTUtpJzZ0B62P4xGao4EJc5SbfOR_hJsk1o7eMiexuzqjMlGI551RQUQxOkh6TUqaKM34a5xinXX6eXCBuKKU5K3gv-Rq21vl06bDVWzLydePq1rdIZmD8unaN8zXxFRnvfDdGZt7oBoiriSbP7bZx6QIhkPkBG9iRe41gSVReIWCHu894n8E-AEIdzZ-6iTYudo0_ugDjIxJdW7L0zsBlclbpLcLVz9lPFpPx2-gxnb48PI2G09QIppq0sMyyQubGaBArXa3UygqQVFNtgQ6kMVWWMWuELAbcMi21kDlbFUZJZassF_3k5ti7D_69BWzKjW9D_DOWfMAVzbiQKlLsSJngEQNU5T64nQ6HktGyW335Z_XR4UcHI1uvIfw2_y99A7qtiCs</recordid><startdate>20220901</startdate><enddate>20220901</enddate><creator>Savchenko, A. V.</creator><creator>Savchenko, L. V.</creator><general>Pleiades Publishing</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20220901</creationdate><title>Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice</title><author>Savchenko, A. V. ; Savchenko, L. V.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c316t-9d1d1958ccae3bafb6bd3e50a0ade075ccf441dc35972d1a5a3581b9c656df483</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Artificial neural networks</topic><topic>Classifiers</topic><topic>Computer Science</topic><topic>Customization</topic><topic>Electronic devices</topic><topic>Emotion recognition</topic><topic>Emotional factors</topic><topic>Emotions</topic><topic>Face recognition</topic><topic>Feature extraction</topic><topic>Image Processing and Computer Vision</topic><topic>Neural networks</topic><topic>Pattern Recognition</topic><topic>Real time</topic><topic>SELECTED PAPERS OF THE 8th INTERNATIONAL WORKSHOP “IMAGE MINING. THEORY AND APPLICATIONS”</topic><topic>Speech recognition</topic><topic>Tracking</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Savchenko, A. V.</creatorcontrib><creatorcontrib>Savchenko, L. V.</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Pattern recognition and image analysis</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Savchenko, A. V.</au><au>Savchenko, L. V.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice</atitle><jtitle>Pattern recognition and image analysis</jtitle><stitle>Pattern Recognit. Image Anal</stitle><date>2022-09-01</date><risdate>2022</risdate><volume>32</volume><issue>3</issue><spage>665</spage><epage>671</epage><pages>665-671</pages><issn>1054-6618</issn><eissn>1555-6212</eissn><abstract>This paper is devoted to tracking dynamics of psycho-emotional state based on analysis of the user’s facial video and voice. We propose a novel technology with personalized acoustic and visual lightweight neural network models that can be launched in real-time on any laptop or even mobile device. At first, two separate user-independent classifiers (feed-forward neural networks) are trained for speech emotion and facial expression recognition in video. The former extracts acoustic features with OpenL3 or OpenSmile frameworks. The latter is based on preliminary extraction of emotional features from each frame with a pre-trained convolutional neural network. Next, both classifiers are fine-tuned using a small number of short emotional videos that should be available for each user. The face of a user is identified during the real-time tracking of emotional state to choose the concrete neural networks. The final decision about current emotion in a short time frame is predicted by blending the outputs of personalized audio and video classifiers. It is experimentally demonstrated for the Russian Acted Multimodal Affective Set that the proposed approach makes it possible to increase the emotion recognition accuracy by 2–15%.</abstract><cop>Moscow</cop><pub>Pleiades Publishing</pub><doi>10.1134/S1054661822030397</doi><tpages>7</tpages></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1054-6618 |
ispartof | Pattern recognition and image analysis, 2022-09, Vol.32 (3), p.665-671 |
issn | 1054-6618 1555-6212 |
language | eng |
recordid | cdi_proquest_journals_2726042356 |
source | Springer Nature - Complete Springer Journals |
subjects | Artificial neural networks Classifiers Computer Science Customization Electronic devices Emotion recognition Emotional factors Emotions Face recognition Feature extraction Image Processing and Computer Vision Neural networks Pattern Recognition Real time SELECTED PAPERS OF THE 8th INTERNATIONAL WORKSHOP “IMAGE MINING. THEORY AND APPLICATIONS” Speech recognition Tracking |
title | Audio-Visual Continuous Recognition of Emotional State in a Multi-User System Based on Personalized Representation of Facial Expressions and Voice |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-18T22%3A08%3A34IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Audio-Visual%20Continuous%20Recognition%20of%20Emotional%20State%20in%20a%20Multi-User%20System%20Based%20on%20Personalized%20Representation%20of%20Facial%20Expressions%20and%20Voice&rft.jtitle=Pattern%20recognition%20and%20image%20analysis&rft.au=Savchenko,%20A.%20V.&rft.date=2022-09-01&rft.volume=32&rft.issue=3&rft.spage=665&rft.epage=671&rft.pages=665-671&rft.issn=1054-6618&rft.eissn=1555-6212&rft_id=info:doi/10.1134/S1054661822030397&rft_dat=%3Cproquest_cross%3E2726042356%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2726042356&rft_id=info:pmid/&rfr_iscdi=true |