Deep learning-based late fusion of multimodal information for emotion classification of music video
Affective computing is an emerging area of research that aims to enable intelligent systems to recognize, feel, infer and interpret human emotions. The widely spread online and off-line music videos are one of the rich sources of human emotion analysis because it integrates the composer’s internal f...
Gespeichert in:
Veröffentlicht in: | Multimedia tools and applications 2021-01, Vol.80 (2), p.2887-2905 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 2905 |
---|---|
container_issue | 2 |
container_start_page | 2887 |
container_title | Multimedia tools and applications |
container_volume | 80 |
creator | Pandeya, Yagya Raj Lee, Joonwhoan |
description | Affective computing is an emerging area of research that aims to enable intelligent systems to recognize, feel, infer and interpret human emotions. The widely spread online and off-line music videos are one of the rich sources of human emotion analysis because it integrates the composer’s internal feeling through song lyrics, musical instruments performance and visual expression. In general, the metadata which music video customers to choose a product includes high-level semantics like emotion so that automatic emotion analysis might be necessary. In this research area, however, the lack of a labeled dataset is a major problem. Therefore, we first construct a balanced music video emotion dataset including diversity of territory, language, culture and musical instruments. We test this dataset over four unimodal and four multimodal convolutional neural networks (CNN) of music and video. First, we separately fine-tuned each pre-trained unimodal CNN and test the performance on unseen data. In addition, we train a 1-dimensional CNN-based music emotion classifier with raw waveform input. The comparative analysis of each unimodal classifier over various optimizers is made to find the best model that can be integrate into a multimodal structure. The best unimodal modality is integrated with corresponding music and video network features for multimodal classifier. The multimodal structure integrates whole music video features and makes final classification with the SoftMax classifier by a late feature fusion strategy. All possible multimodal structures are also combined into one predictive model to get the overall prediction. All the proposed multimodal structure uses cross-validation to overcome the data scarcity problem (overfitting) at the decision level. The evaluation results using various metrics show a boost in the performance of the multimodal architectures compared to each unimodal emotion classifier. The predictive model by integration of all multimodal structure achieves 88.56% in accuracy, 0.88 in f1-score, and 0.987 in area under the curve (AUC) score. The result suggests human high-level emotions are automatically well classified in the proposed CNN-based multimodal networks, even though a small amount of labeled data samples is available for training. |
doi_str_mv | 10.1007/s11042-020-08836-3 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2478170513</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2478170513</sourcerecordid><originalsourceid>FETCH-LOGICAL-c429t-9c9c3d40109424b3ffb6b5e42eaadb65926cd285a550669f5e7994a12599babf3</originalsourceid><addsrcrecordid>eNp9kEtLxDAUhYMoOI7-AVcB19GbV9MsZXzCgBtdhzRNhgxtMyYdwX9vZyq4c3UPnPOdCwehawq3FEDdFUpBMAIMCNQ1rwg_QQsqFSdKMXo6aV4DURLoObooZQtAK8nEArkH73e48zYPcdiQxhbf4s6OHod9iWnAKeB-342xT63tcBxCyr0dD86ksO_TUbvOlhJDdLN1hEp0-Cu2Pl2is2C74q9-7xJ9PD2-r17I-u35dXW_Jk4wPRLttOOtAApaMNHwEJqqkV4wb23bVFKzyrWsllZKqCodpFdaC0uZ1LqxTeBLdDP37nL63Psymm3a52F6aZhQNVUgKZ9SbE65nErJPphdjr3N34aCOYxp5jHNNKY5jmkOEJ-hMoWHjc9_1f9QP9w9eHY</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2478170513</pqid></control><display><type>article</type><title>Deep learning-based late fusion of multimodal information for emotion classification of music video</title><source>Springer journals</source><creator>Pandeya, Yagya Raj ; Lee, Joonwhoan</creator><creatorcontrib>Pandeya, Yagya Raj ; Lee, Joonwhoan</creatorcontrib><description>Affective computing is an emerging area of research that aims to enable intelligent systems to recognize, feel, infer and interpret human emotions. The widely spread online and off-line music videos are one of the rich sources of human emotion analysis because it integrates the composer’s internal feeling through song lyrics, musical instruments performance and visual expression. In general, the metadata which music video customers to choose a product includes high-level semantics like emotion so that automatic emotion analysis might be necessary. In this research area, however, the lack of a labeled dataset is a major problem. Therefore, we first construct a balanced music video emotion dataset including diversity of territory, language, culture and musical instruments. We test this dataset over four unimodal and four multimodal convolutional neural networks (CNN) of music and video. First, we separately fine-tuned each pre-trained unimodal CNN and test the performance on unseen data. In addition, we train a 1-dimensional CNN-based music emotion classifier with raw waveform input. The comparative analysis of each unimodal classifier over various optimizers is made to find the best model that can be integrate into a multimodal structure. The best unimodal modality is integrated with corresponding music and video network features for multimodal classifier. The multimodal structure integrates whole music video features and makes final classification with the SoftMax classifier by a late feature fusion strategy. All possible multimodal structures are also combined into one predictive model to get the overall prediction. All the proposed multimodal structure uses cross-validation to overcome the data scarcity problem (overfitting) at the decision level. The evaluation results using various metrics show a boost in the performance of the multimodal architectures compared to each unimodal emotion classifier. The predictive model by integration of all multimodal structure achieves 88.56% in accuracy, 0.88 in f1-score, and 0.987 in area under the curve (AUC) score. The result suggests human high-level emotions are automatically well classified in the proposed CNN-based multimodal networks, even though a small amount of labeled data samples is available for training.</description><identifier>ISSN: 1380-7501</identifier><identifier>EISSN: 1573-7721</identifier><identifier>DOI: 10.1007/s11042-020-08836-3</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Affective computing ; Artificial neural networks ; Classification ; Classifiers ; Computer Communication Networks ; Computer Science ; Data Structures and Information Theory ; Datasets ; Deep learning ; Emotions ; Multimedia Information Systems ; Music videos ; Musical instruments ; Prediction models ; Predictions ; Semantics ; Special Purpose and Application-Based Systems ; Video data ; Waveforms</subject><ispartof>Multimedia tools and applications, 2021-01, Vol.80 (2), p.2887-2905</ispartof><rights>The Author(s) 2020</rights><rights>The Author(s) 2020. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c429t-9c9c3d40109424b3ffb6b5e42eaadb65926cd285a550669f5e7994a12599babf3</citedby><cites>FETCH-LOGICAL-c429t-9c9c3d40109424b3ffb6b5e42eaadb65926cd285a550669f5e7994a12599babf3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11042-020-08836-3$$EPDF$$P50$$Gspringer$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s11042-020-08836-3$$EHTML$$P50$$Gspringer$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Pandeya, Yagya Raj</creatorcontrib><creatorcontrib>Lee, Joonwhoan</creatorcontrib><title>Deep learning-based late fusion of multimodal information for emotion classification of music video</title><title>Multimedia tools and applications</title><addtitle>Multimed Tools Appl</addtitle><description>Affective computing is an emerging area of research that aims to enable intelligent systems to recognize, feel, infer and interpret human emotions. The widely spread online and off-line music videos are one of the rich sources of human emotion analysis because it integrates the composer’s internal feeling through song lyrics, musical instruments performance and visual expression. In general, the metadata which music video customers to choose a product includes high-level semantics like emotion so that automatic emotion analysis might be necessary. In this research area, however, the lack of a labeled dataset is a major problem. Therefore, we first construct a balanced music video emotion dataset including diversity of territory, language, culture and musical instruments. We test this dataset over four unimodal and four multimodal convolutional neural networks (CNN) of music and video. First, we separately fine-tuned each pre-trained unimodal CNN and test the performance on unseen data. In addition, we train a 1-dimensional CNN-based music emotion classifier with raw waveform input. The comparative analysis of each unimodal classifier over various optimizers is made to find the best model that can be integrate into a multimodal structure. The best unimodal modality is integrated with corresponding music and video network features for multimodal classifier. The multimodal structure integrates whole music video features and makes final classification with the SoftMax classifier by a late feature fusion strategy. All possible multimodal structures are also combined into one predictive model to get the overall prediction. All the proposed multimodal structure uses cross-validation to overcome the data scarcity problem (overfitting) at the decision level. The evaluation results using various metrics show a boost in the performance of the multimodal architectures compared to each unimodal emotion classifier. The predictive model by integration of all multimodal structure achieves 88.56% in accuracy, 0.88 in f1-score, and 0.987 in area under the curve (AUC) score. The result suggests human high-level emotions are automatically well classified in the proposed CNN-based multimodal networks, even though a small amount of labeled data samples is available for training.</description><subject>Affective computing</subject><subject>Artificial neural networks</subject><subject>Classification</subject><subject>Classifiers</subject><subject>Computer Communication Networks</subject><subject>Computer Science</subject><subject>Data Structures and Information Theory</subject><subject>Datasets</subject><subject>Deep learning</subject><subject>Emotions</subject><subject>Multimedia Information Systems</subject><subject>Music videos</subject><subject>Musical instruments</subject><subject>Prediction models</subject><subject>Predictions</subject><subject>Semantics</subject><subject>Special Purpose and Application-Based Systems</subject><subject>Video data</subject><subject>Waveforms</subject><issn>1380-7501</issn><issn>1573-7721</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>C6C</sourceid><sourceid>8G5</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>GUQSH</sourceid><sourceid>M2O</sourceid><recordid>eNp9kEtLxDAUhYMoOI7-AVcB19GbV9MsZXzCgBtdhzRNhgxtMyYdwX9vZyq4c3UPnPOdCwehawq3FEDdFUpBMAIMCNQ1rwg_QQsqFSdKMXo6aV4DURLoObooZQtAK8nEArkH73e48zYPcdiQxhbf4s6OHod9iWnAKeB-342xT63tcBxCyr0dD86ksO_TUbvOlhJDdLN1hEp0-Cu2Pl2is2C74q9-7xJ9PD2-r17I-u35dXW_Jk4wPRLttOOtAApaMNHwEJqqkV4wb23bVFKzyrWsllZKqCodpFdaC0uZ1LqxTeBLdDP37nL63Psymm3a52F6aZhQNVUgKZ9SbE65nErJPphdjr3N34aCOYxp5jHNNKY5jmkOEJ-hMoWHjc9_1f9QP9w9eHY</recordid><startdate>20210101</startdate><enddate>20210101</enddate><creator>Pandeya, Yagya Raj</creator><creator>Lee, Joonwhoan</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>C6C</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7SC</scope><scope>7WY</scope><scope>7WZ</scope><scope>7XB</scope><scope>87Z</scope><scope>8AL</scope><scope>8AO</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>8FL</scope><scope>8G5</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BEZIV</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FRNLG</scope><scope>F~G</scope><scope>GNUQQ</scope><scope>GUQSH</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K60</scope><scope>K6~</scope><scope>K7-</scope><scope>L.-</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M0C</scope><scope>M0N</scope><scope>M2O</scope><scope>MBDVC</scope><scope>P5Z</scope><scope>P62</scope><scope>PQBIZ</scope><scope>PQBZA</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope></search><sort><creationdate>20210101</creationdate><title>Deep learning-based late fusion of multimodal information for emotion classification of music video</title><author>Pandeya, Yagya Raj ; Lee, Joonwhoan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c429t-9c9c3d40109424b3ffb6b5e42eaadb65926cd285a550669f5e7994a12599babf3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Affective computing</topic><topic>Artificial neural networks</topic><topic>Classification</topic><topic>Classifiers</topic><topic>Computer Communication Networks</topic><topic>Computer Science</topic><topic>Data Structures and Information Theory</topic><topic>Datasets</topic><topic>Deep learning</topic><topic>Emotions</topic><topic>Multimedia Information Systems</topic><topic>Music videos</topic><topic>Musical instruments</topic><topic>Prediction models</topic><topic>Predictions</topic><topic>Semantics</topic><topic>Special Purpose and Application-Based Systems</topic><topic>Video data</topic><topic>Waveforms</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Pandeya, Yagya Raj</creatorcontrib><creatorcontrib>Lee, Joonwhoan</creatorcontrib><collection>SpringerOpen</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Computer and Information Systems Abstracts</collection><collection>Access via ABI/INFORM (ProQuest)</collection><collection>ABI/INFORM Global (PDF only)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>ABI/INFORM Collection</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ABI/INFORM Collection (Alumni Edition)</collection><collection>Research Library (Alumni Edition)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>ProQuest Business Premium Collection</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>Business Premium Collection (Alumni)</collection><collection>ABI/INFORM Global (Corporate)</collection><collection>ProQuest Central Student</collection><collection>Research Library Prep</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Business Collection (Alumni Edition)</collection><collection>ProQuest Business Collection</collection><collection>Computer Science Database</collection><collection>ABI/INFORM Professional Advanced</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ABI/INFORM Global</collection><collection>Computing Database</collection><collection>ProQuest research library</collection><collection>Research Library (Corporate)</collection><collection>ProQuest advanced technologies & aerospace journals</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest One Business</collection><collection>ProQuest One Business (Alumni)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><jtitle>Multimedia tools and applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Pandeya, Yagya Raj</au><au>Lee, Joonwhoan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Deep learning-based late fusion of multimodal information for emotion classification of music video</atitle><jtitle>Multimedia tools and applications</jtitle><stitle>Multimed Tools Appl</stitle><date>2021-01-01</date><risdate>2021</risdate><volume>80</volume><issue>2</issue><spage>2887</spage><epage>2905</epage><pages>2887-2905</pages><issn>1380-7501</issn><eissn>1573-7721</eissn><abstract>Affective computing is an emerging area of research that aims to enable intelligent systems to recognize, feel, infer and interpret human emotions. The widely spread online and off-line music videos are one of the rich sources of human emotion analysis because it integrates the composer’s internal feeling through song lyrics, musical instruments performance and visual expression. In general, the metadata which music video customers to choose a product includes high-level semantics like emotion so that automatic emotion analysis might be necessary. In this research area, however, the lack of a labeled dataset is a major problem. Therefore, we first construct a balanced music video emotion dataset including diversity of territory, language, culture and musical instruments. We test this dataset over four unimodal and four multimodal convolutional neural networks (CNN) of music and video. First, we separately fine-tuned each pre-trained unimodal CNN and test the performance on unseen data. In addition, we train a 1-dimensional CNN-based music emotion classifier with raw waveform input. The comparative analysis of each unimodal classifier over various optimizers is made to find the best model that can be integrate into a multimodal structure. The best unimodal modality is integrated with corresponding music and video network features for multimodal classifier. The multimodal structure integrates whole music video features and makes final classification with the SoftMax classifier by a late feature fusion strategy. All possible multimodal structures are also combined into one predictive model to get the overall prediction. All the proposed multimodal structure uses cross-validation to overcome the data scarcity problem (overfitting) at the decision level. The evaluation results using various metrics show a boost in the performance of the multimodal architectures compared to each unimodal emotion classifier. The predictive model by integration of all multimodal structure achieves 88.56% in accuracy, 0.88 in f1-score, and 0.987 in area under the curve (AUC) score. The result suggests human high-level emotions are automatically well classified in the proposed CNN-based multimodal networks, even though a small amount of labeled data samples is available for training.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s11042-020-08836-3</doi><tpages>19</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1380-7501 |
ispartof | Multimedia tools and applications, 2021-01, Vol.80 (2), p.2887-2905 |
issn | 1380-7501 1573-7721 |
language | eng |
recordid | cdi_proquest_journals_2478170513 |
source | Springer journals |
subjects | Affective computing Artificial neural networks Classification Classifiers Computer Communication Networks Computer Science Data Structures and Information Theory Datasets Deep learning Emotions Multimedia Information Systems Music videos Musical instruments Prediction models Predictions Semantics Special Purpose and Application-Based Systems Video data Waveforms |
title | Deep learning-based late fusion of multimodal information for emotion classification of music video |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-20T17%3A35%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Deep%20learning-based%20late%20fusion%20of%20multimodal%20information%20for%20emotion%20classification%20of%20music%20video&rft.jtitle=Multimedia%20tools%20and%20applications&rft.au=Pandeya,%20Yagya%20Raj&rft.date=2021-01-01&rft.volume=80&rft.issue=2&rft.spage=2887&rft.epage=2905&rft.pages=2887-2905&rft.issn=1380-7501&rft.eissn=1573-7721&rft_id=info:doi/10.1007/s11042-020-08836-3&rft_dat=%3Cproquest_cross%3E2478170513%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2478170513&rft_id=info:pmid/&rfr_iscdi=true |