An Efficient Bi-modal Fusion Framework for Music Emotion Recognition

Current methods for Music Emotion Recognition (MER) face challenges in effectively extracting features sensitive to emotions, especially those rich in temporal detail. Moreover, the narrow scope of music-related modalities impedes data integration from multiple sources, while including multiple moda...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on affective computing 2024-10, p.1-17
Hauptverfasser:	Xiao, Yao, Ruan, Haoxin, Zhao, Xujian, Jin, Peiquan, Tian, Li, Wei, Zihan, Cai, Xuebo, Wang, Yixin, Liu, Liang
Format:	Artikel
Sprache:	eng
Schlagworte:	Affective computing Arousal-Valence Bi-modal Biological system modeling Computational modeling Data mining Deep learning Emotion recognition Feature extraction Instruments Multi-head self-attention Music Music emotion recognition Redundancy Semantics Short-chunk CNN
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Current methods for Music Emotion Recognition (MER) face challenges in effectively extracting features sensitive to emotions, especially those rich in temporal detail. Moreover, the narrow scope of music-related modalities impedes data integration from multiple sources, while including multiple modalities often leads to redundant information, which can degrade performance. To address these issues, we propose a lightweight framework for music emotion recognition that improves the extraction of features that are both sensitive to emotions and rich in temporal information and that integrates data from both audio and MIDI modalities while minimizing redundancy. Our approach involves developing two innovative unimodal encoders to learn embeddings from audio and MIDI-like features. Additionally, we introduce a Bi-modal Fusion Attention Model (BFAM) that integrates features from low-level to high-level semantic information across different modalities. Experimental evaluations on the EMOPIA and VGMIDI datasets show that our unimodal networks achieve accuracies that are 6.1% and 4.4% higher than baseline algorithms for MIDI and audio on the EMOPIA dataset, respectively. Furthermore, our BFAM achieves a 15.2% improvement in accuracy over the baseline, reaching 82.2%, which underscores its effectiveness for bi-modal MER applications
ISSN:	1949-3045 1949-3045
DOI:	10.1109/TAFFC.2024.3486340