MCF-Net: Fusion Network of Facial and Scene Features for Expression Recognition in the Wild

Nowadays, the facial expression recognition (FER) task has transitioned from a laboratory-controlled scenario to in-the-wild conditions. However, recognizing facial expressions in the wild is challenging due to factors such as variant backgrounds, low-quality facial images, and the subjectiveness of...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Applied sciences 2022-10, Vol.12 (20), p.10251
Hauptverfasser:	Xu, Hui, Kong, Jun, Kong, Xiangqin, Li, Juan, Wang, Jianzhong
Format:	Artikel
Sprache:	eng
Schlagworte:	Benchmarks Coding Communication Datasets Deep learning expression recognition Face recognition feature fusion Feature recognition Image quality Model accuracy multi-scale attention Neural networks Pattern recognition Performance evaluation Protective equipment Relation Attention Representations sparse mask attention learning
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Nowadays, the facial expression recognition (FER) task has transitioned from a laboratory-controlled scenario to in-the-wild conditions. However, recognizing facial expressions in the wild is challenging due to factors such as variant backgrounds, low-quality facial images, and the subjectiveness of annotators. Therefore, deep neural networks have increasingly been leveraged to learn discriminative representations for FER. In this work, we propose the Multi-cues Fusion Net (MCF-Net), a novel deep learning model with a two-stream structure for FER. Our model first proposes a two-stream coding network to extract face and scene representations. Then, an adaptive fusion module is employed to fuse the two different representations for final recognition. In the face coding stream, a Sparse Mask Attention Learning (SMAL) module is utilized to adaptively generate the corresponding sparse face mask according to the input image. Meanwhile, we employ a Multi-scale Attention (MSA) module for extracting fine-grained feature subsets, which can obtain richer multi-scale interaction information. In the scene coding stream, a Relational Attention (RA) module is applied to construct the hidden relationship between the face and contextual features of non-facial regions by capturing the pairwise similarity. In order to verify the effectiveness and accuracy of our model, a large number of experiments are carried out on two public large-scale static facial expression image datasets, CAER-S and NCAER-S. By comparing the performance of our MCF-Net with other methods, the proposed model achieves superior results on two in-the-wild FER benchmarks: CAER-S with an accuracy of 81.82% and NCAER-S with an accuracy of 45.59%.
ISSN:	2076-3417 2076-3417
DOI:	10.3390/app122010251