A joint hierarchical cross‐attention graph convolutional network for multi‐modal facial expression recognition

Emotional recognition in conversations (ERC) is increasingly being applied in various IoT devices. Deep learning‐based multimodal ERC has achieved great success by leveraging diverse and complementary modalities. Although most existing methods try to adopt attention mechanisms to fuse different info...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Computational intelligence 2024-02, Vol.40 (1), p.n/a
Hauptverfasser:	Xu, Chujie, Du, Yong, Wang, Jingzi, Zheng, Wenjie, Li, Tiejun, Yuan, Zhansheng
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial neural networks Audio data Coders Context Convolution cross‐attention mechanism Deep learning emotional recognition in conversations Face recognition Feature extraction graph convolution network IoT multi‐modal fusion transformer Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	n/a
container_issue	1
container_start_page
container_title	Computational intelligence
container_volume	40
creator	Xu, Chujie Du, Yong Wang, Jingzi Zheng, Wenjie Li, Tiejun Yuan, Zhansheng
description	Emotional recognition in conversations (ERC) is increasingly being applied in various IoT devices. Deep learning‐based multimodal ERC has achieved great success by leveraging diverse and complementary modalities. Although most existing methods try to adopt attention mechanisms to fuse different information, these methods ignore the complementarity between modalities. To this end, the joint cross‐attention model is introduced to alleviate this issue. However, multi‐scale feature information on different modalities is not utilized. Moreover, the context relationship plays an important role in feature extraction in the expression recognition task. In this paper, we propose a novel joint hierarchical graph convolution network (JHGCN) which exploits different layer features and context relationships for facial expression recognition based on audio‐visual (A‐V) information. Specifically, we adopt different deep networks to extract features from different modalities individually. For V modality, we construct V graph data based on patch embeddings which are extracted from the transformer encoder. Moreover, we embed the graph convolution which can leverage the intra‐modality relationships with the transformer encoder. Then, the deep feature from different layers is fed to the hierarchical fusion module to enhance feature representation. At last, we use the joint cross‐attention mechanism to exploit the complementary inter‐modality relationships. To validate the proposed model, we have conducted various experiments on the AffWild2 and CMU‐MOSI datasets. All results confirm that our proposed model achieves highly promising performance compared to the joint cross‐attention model and other methods.
doi_str_mv	10.1111/coin.12607
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2930966808</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2930966808</sourcerecordid><originalsourceid>FETCH-LOGICAL-c3017-40b59a1198fc4a4c8dc7aa6d3a0e298335eb132d4f10c8d3178fbf47a977d5c53</originalsourceid><addsrcrecordid>eNp9kE1OwzAQRi0EEqWw4QSR2CGl2LET28uq4qdSRTewtlzHbl3SONgJpTuOwBk5CU7DGm9G8rw3mvkAuEZwguK7U87WE5QVkJ6AESIFTVlB4CkYQZaRlHKcn4OLELYQQoQJGwE_TbbRaZON1V56tbFKVonyLoSfr2_ZtrpurauTtZfNJlGu_nBV1_9Eqtbt3vm3xDif7LqqtdHYuTJ2jFQ2Fv3ZeB1C73ut3Lq2vXkJzoysgr76q2Pw-nD_MntKF8vH-Wy6SBWGiKYErnIuEeLMKCKJYqWiUhYlllBnnGGc6xXCWUkMgrGJEWVmZQiVnNIyVzkeg5thbuPde6dDK7au83HxIDKOIS8KBlmkbgfqeLPXRjTe7qQ_CARFn6noMxXHTCOMBnhvK334hxSz5fx5cH4BgFx-NA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2930966808</pqid></control><display><type>article</type><title>A joint hierarchical cross‐attention graph convolutional network for multi‐modal facial expression recognition</title><source>Wiley Online Library Journals Frontfile Complete</source><source>Business Source Complete</source><creator>Xu, Chujie ; Du, Yong ; Wang, Jingzi ; Zheng, Wenjie ; Li, Tiejun ; Yuan, Zhansheng</creator><creatorcontrib>Xu, Chujie ; Du, Yong ; Wang, Jingzi ; Zheng, Wenjie ; Li, Tiejun ; Yuan, Zhansheng</creatorcontrib><description>Emotional recognition in conversations (ERC) is increasingly being applied in various IoT devices. Deep learning‐based multimodal ERC has achieved great success by leveraging diverse and complementary modalities. Although most existing methods try to adopt attention mechanisms to fuse different information, these methods ignore the complementarity between modalities. To this end, the joint cross‐attention model is introduced to alleviate this issue. However, multi‐scale feature information on different modalities is not utilized. Moreover, the context relationship plays an important role in feature extraction in the expression recognition task. In this paper, we propose a novel joint hierarchical graph convolution network (JHGCN) which exploits different layer features and context relationships for facial expression recognition based on audio‐visual (A‐V) information. Specifically, we adopt different deep networks to extract features from different modalities individually. For V modality, we construct V graph data based on patch embeddings which are extracted from the transformer encoder. Moreover, we embed the graph convolution which can leverage the intra‐modality relationships with the transformer encoder. Then, the deep feature from different layers is fed to the hierarchical fusion module to enhance feature representation. At last, we use the joint cross‐attention mechanism to exploit the complementary inter‐modality relationships. To validate the proposed model, we have conducted various experiments on the AffWild2 and CMU‐MOSI datasets. All results confirm that our proposed model achieves highly promising performance compared to the joint cross‐attention model and other methods.</description><identifier>ISSN: 0824-7935</identifier><identifier>EISSN: 1467-8640</identifier><identifier>DOI: 10.1111/coin.12607</identifier><language>eng</language><publisher>Hoboken, USA: John Wiley & Sons, Inc</publisher><subject>Artificial neural networks ; Audio data ; Coders ; Context ; Convolution ; cross‐attention mechanism ; Deep learning ; emotional recognition in conversations ; Face recognition ; Feature extraction ; graph convolution network ; IoT ; multi‐modal fusion ; transformer ; Transformers</subject><ispartof>Computational intelligence, 2024-02, Vol.40 (1), p.n/a</ispartof><rights>2023 Wiley Periodicals LLC.</rights><rights>2024 Wiley Periodicals LLC.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c3017-40b59a1198fc4a4c8dc7aa6d3a0e298335eb132d4f10c8d3178fbf47a977d5c53</citedby><cites>FETCH-LOGICAL-c3017-40b59a1198fc4a4c8dc7aa6d3a0e298335eb132d4f10c8d3178fbf47a977d5c53</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://onlinelibrary.wiley.com/doi/pdf/10.1111%2Fcoin.12607$$EPDF$$P50$$Gwiley$$H</linktopdf><linktohtml>$$Uhttps://onlinelibrary.wiley.com/doi/full/10.1111%2Fcoin.12607$$EHTML$$P50$$Gwiley$$H</linktohtml><link.rule.ids>314,776,780,1411,27901,27902,45550,45551</link.rule.ids></links><search><creatorcontrib>Xu, Chujie</creatorcontrib><creatorcontrib>Du, Yong</creatorcontrib><creatorcontrib>Wang, Jingzi</creatorcontrib><creatorcontrib>Zheng, Wenjie</creatorcontrib><creatorcontrib>Li, Tiejun</creatorcontrib><creatorcontrib>Yuan, Zhansheng</creatorcontrib><title>A joint hierarchical cross‐attention graph convolutional network for multi‐modal facial expression recognition</title><title>Computational intelligence</title><description>Emotional recognition in conversations (ERC) is increasingly being applied in various IoT devices. Deep learning‐based multimodal ERC has achieved great success by leveraging diverse and complementary modalities. Although most existing methods try to adopt attention mechanisms to fuse different information, these methods ignore the complementarity between modalities. To this end, the joint cross‐attention model is introduced to alleviate this issue. However, multi‐scale feature information on different modalities is not utilized. Moreover, the context relationship plays an important role in feature extraction in the expression recognition task. In this paper, we propose a novel joint hierarchical graph convolution network (JHGCN) which exploits different layer features and context relationships for facial expression recognition based on audio‐visual (A‐V) information. Specifically, we adopt different deep networks to extract features from different modalities individually. For V modality, we construct V graph data based on patch embeddings which are extracted from the transformer encoder. Moreover, we embed the graph convolution which can leverage the intra‐modality relationships with the transformer encoder. Then, the deep feature from different layers is fed to the hierarchical fusion module to enhance feature representation. At last, we use the joint cross‐attention mechanism to exploit the complementary inter‐modality relationships. To validate the proposed model, we have conducted various experiments on the AffWild2 and CMU‐MOSI datasets. All results confirm that our proposed model achieves highly promising performance compared to the joint cross‐attention model and other methods.</description><subject>Artificial neural networks</subject><subject>Audio data</subject><subject>Coders</subject><subject>Context</subject><subject>Convolution</subject><subject>cross‐attention mechanism</subject><subject>Deep learning</subject><subject>emotional recognition in conversations</subject><subject>Face recognition</subject><subject>Feature extraction</subject><subject>graph convolution network</subject><subject>IoT</subject><subject>multi‐modal fusion</subject><subject>transformer</subject><subject>Transformers</subject><issn>0824-7935</issn><issn>1467-8640</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNp9kE1OwzAQRi0EEqWw4QSR2CGl2LET28uq4qdSRTewtlzHbl3SONgJpTuOwBk5CU7DGm9G8rw3mvkAuEZwguK7U87WE5QVkJ6AESIFTVlB4CkYQZaRlHKcn4OLELYQQoQJGwE_TbbRaZON1V56tbFKVonyLoSfr2_ZtrpurauTtZfNJlGu_nBV1_9Eqtbt3vm3xDif7LqqtdHYuTJ2jFQ2Fv3ZeB1C73ut3Lq2vXkJzoysgr76q2Pw-nD_MntKF8vH-Wy6SBWGiKYErnIuEeLMKCKJYqWiUhYlllBnnGGc6xXCWUkMgrGJEWVmZQiVnNIyVzkeg5thbuPde6dDK7au83HxIDKOIS8KBlmkbgfqeLPXRjTe7qQ_CARFn6noMxXHTCOMBnhvK334hxSz5fx5cH4BgFx-NA</recordid><startdate>202402</startdate><enddate>202402</enddate><creator>Xu, Chujie</creator><creator>Du, Yong</creator><creator>Wang, Jingzi</creator><creator>Zheng, Wenjie</creator><creator>Li, Tiejun</creator><creator>Yuan, Zhansheng</creator><general>John Wiley & Sons, Inc</general><general>Blackwell Publishing Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>202402</creationdate><title>A joint hierarchical cross‐attention graph convolutional network for multi‐modal facial expression recognition</title><author>Xu, Chujie ; Du, Yong ; Wang, Jingzi ; Zheng, Wenjie ; Li, Tiejun ; Yuan, Zhansheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c3017-40b59a1198fc4a4c8dc7aa6d3a0e298335eb132d4f10c8d3178fbf47a977d5c53</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Artificial neural networks</topic><topic>Audio data</topic><topic>Coders</topic><topic>Context</topic><topic>Convolution</topic><topic>cross‐attention mechanism</topic><topic>Deep learning</topic><topic>emotional recognition in conversations</topic><topic>Face recognition</topic><topic>Feature extraction</topic><topic>graph convolution network</topic><topic>IoT</topic><topic>multi‐modal fusion</topic><topic>transformer</topic><topic>Transformers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Xu, Chujie</creatorcontrib><creatorcontrib>Du, Yong</creatorcontrib><creatorcontrib>Wang, Jingzi</creatorcontrib><creatorcontrib>Zheng, Wenjie</creatorcontrib><creatorcontrib>Li, Tiejun</creatorcontrib><creatorcontrib>Yuan, Zhansheng</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Computational intelligence</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Xu, Chujie</au><au>Du, Yong</au><au>Wang, Jingzi</au><au>Zheng, Wenjie</au><au>Li, Tiejun</au><au>Yuan, Zhansheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A joint hierarchical cross‐attention graph convolutional network for multi‐modal facial expression recognition</atitle><jtitle>Computational intelligence</jtitle><date>2024-02</date><risdate>2024</risdate><volume>40</volume><issue>1</issue><epage>n/a</epage><issn>0824-7935</issn><eissn>1467-8640</eissn><abstract>Emotional recognition in conversations (ERC) is increasingly being applied in various IoT devices. Deep learning‐based multimodal ERC has achieved great success by leveraging diverse and complementary modalities. Although most existing methods try to adopt attention mechanisms to fuse different information, these methods ignore the complementarity between modalities. To this end, the joint cross‐attention model is introduced to alleviate this issue. However, multi‐scale feature information on different modalities is not utilized. Moreover, the context relationship plays an important role in feature extraction in the expression recognition task. In this paper, we propose a novel joint hierarchical graph convolution network (JHGCN) which exploits different layer features and context relationships for facial expression recognition based on audio‐visual (A‐V) information. Specifically, we adopt different deep networks to extract features from different modalities individually. For V modality, we construct V graph data based on patch embeddings which are extracted from the transformer encoder. Moreover, we embed the graph convolution which can leverage the intra‐modality relationships with the transformer encoder. Then, the deep feature from different layers is fed to the hierarchical fusion module to enhance feature representation. At last, we use the joint cross‐attention mechanism to exploit the complementary inter‐modality relationships. To validate the proposed model, we have conducted various experiments on the AffWild2 and CMU‐MOSI datasets. All results confirm that our proposed model achieves highly promising performance compared to the joint cross‐attention model and other methods.</abstract><cop>Hoboken, USA</cop><pub>John Wiley & Sons, Inc</pub><doi>10.1111/coin.12607</doi><tpages>18</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 0824-7935
ispartof	Computational intelligence, 2024-02, Vol.40 (1), p.n/a
issn	0824-7935 1467-8640
language	eng
recordid	cdi_proquest_journals_2930966808
source	Wiley Online Library Journals Frontfile Complete; Business Source Complete
subjects	Artificial neural networks Audio data Coders Context Convolution cross‐attention mechanism Deep learning emotional recognition in conversations Face recognition Feature extraction graph convolution network IoT multi‐modal fusion transformer Transformers
title	A joint hierarchical cross‐attention graph convolutional network for multi‐modal facial expression recognition
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-15T03%3A59%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20joint%20hierarchical%20cross%E2%80%90attention%20graph%20convolutional%20network%20for%20multi%E2%80%90modal%20facial%20expression%20recognition&rft.jtitle=Computational%20intelligence&rft.au=Xu,%20Chujie&rft.date=2024-02&rft.volume=40&rft.issue=1&rft.epage=n/a&rft.issn=0824-7935&rft.eissn=1467-8640&rft_id=info:doi/10.1111/coin.12607&rft_dat=%3Cproquest_cross%3E2930966808%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2930966808&rft_id=info:pmid/&rfr_iscdi=true