Exploring modality-agnostic representations for music classification

Music information is often conveyed or recorded across multiple data modalities including but not limited to audio, images, text and scores. However, music information retrieval research has almost exclusively focused on single modality recognition, requiring development of separate models for each...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2021-06
Hauptverfasser:	Ho-Hsiang, Wu, Fuentes, Magdalena, Bello, Juan P
Format:	Artikel
Sprache:	eng
Schlagworte:	Acoustics Audio data Audio equipment Classifiers Image classification Information retrieval Music Musical instruments Representations
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Ho-Hsiang, Wu Fuentes, Magdalena Bello, Juan P
description	Music information is often conveyed or recorded across multiple data modalities including but not limited to audio, images, text and scores. However, music information retrieval research has almost exclusively focused on single modality recognition, requiring development of separate models for each modality. Some multi-modal works require multiple coexisting modalities given to the model as inputs, constraining the use of these models to the few cases where data from all modalities are available. To the best of our knowledge, no existing model has the ability to take inputs from varying modalities, e.g. images or sounds, and classify them into unified music categories. We explore the use of cross-modal retrieval as a pretext task to learn modality-agnostic representations, which can then be used as inputs to classifiers that are independent of modality. We select instrument classification as an example task for our study as both visual and audio components provide relevant semantic information. We train music instrument classifiers that can take both images or sounds as input, and perform comparably to sound-only or image-only classifiers. Furthermore, we explore the case when there is limited labeled data for a given modality, and the impact in performance by using labeled data from other modalities. We are able to achieve almost 70% of best performing system in a zero-shot setting. We provide a detailed analysis of experimental results to understand the potential and limitations of the approach, and discuss future steps towards modality-agnostic classifiers.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2536671503</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2536671503</sourcerecordid><originalsourceid>FETCH-proquest_journals_25366715033</originalsourceid><addsrcrecordid>eNqNytEKwiAUgGEJgkbtHYSuB06n674WPUD3Q5YOh_OYx0G9fSN6gK7-i-_fkIILUVenhvMdKREnxhhXLZdSFOTSvaKH5MJIZ3ho7_K70mMAzG6gycRk0ISss4OA1EKi84KrDF4jOuuGrxzI1mqPpvx1T47X7n6-VTHBczGY-wmWFFbquRRKtbVkQvx3fQDvHDvv</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2536671503</pqid></control><display><type>article</type><title>Exploring modality-agnostic representations for music classification</title><source>Freely Accessible Journals</source><creator>Ho-Hsiang, Wu ; Fuentes, Magdalena ; Bello, Juan P</creator><creatorcontrib>Ho-Hsiang, Wu ; Fuentes, Magdalena ; Bello, Juan P</creatorcontrib><description>Music information is often conveyed or recorded across multiple data modalities including but not limited to audio, images, text and scores. However, music information retrieval research has almost exclusively focused on single modality recognition, requiring development of separate models for each modality. Some multi-modal works require multiple coexisting modalities given to the model as inputs, constraining the use of these models to the few cases where data from all modalities are available. To the best of our knowledge, no existing model has the ability to take inputs from varying modalities, e.g. images or sounds, and classify them into unified music categories. We explore the use of cross-modal retrieval as a pretext task to learn modality-agnostic representations, which can then be used as inputs to classifiers that are independent of modality. We select instrument classification as an example task for our study as both visual and audio components provide relevant semantic information. We train music instrument classifiers that can take both images or sounds as input, and perform comparably to sound-only or image-only classifiers. Furthermore, we explore the case when there is limited labeled data for a given modality, and the impact in performance by using labeled data from other modalities. We are able to achieve almost 70% of best performing system in a zero-shot setting. We provide a detailed analysis of experimental results to understand the potential and limitations of the approach, and discuss future steps towards modality-agnostic classifiers.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Acoustics ; Audio data ; Audio equipment ; Classifiers ; Image classification ; Information retrieval ; Music ; Musical instruments ; Representations</subject><ispartof>arXiv.org, 2021-06</ispartof><rights>2021. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Ho-Hsiang, Wu</creatorcontrib><creatorcontrib>Fuentes, Magdalena</creatorcontrib><creatorcontrib>Bello, Juan P</creatorcontrib><title>Exploring modality-agnostic representations for music classification</title><title>arXiv.org</title><description>Music information is often conveyed or recorded across multiple data modalities including but not limited to audio, images, text and scores. However, music information retrieval research has almost exclusively focused on single modality recognition, requiring development of separate models for each modality. Some multi-modal works require multiple coexisting modalities given to the model as inputs, constraining the use of these models to the few cases where data from all modalities are available. To the best of our knowledge, no existing model has the ability to take inputs from varying modalities, e.g. images or sounds, and classify them into unified music categories. We explore the use of cross-modal retrieval as a pretext task to learn modality-agnostic representations, which can then be used as inputs to classifiers that are independent of modality. We select instrument classification as an example task for our study as both visual and audio components provide relevant semantic information. We train music instrument classifiers that can take both images or sounds as input, and perform comparably to sound-only or image-only classifiers. Furthermore, we explore the case when there is limited labeled data for a given modality, and the impact in performance by using labeled data from other modalities. We are able to achieve almost 70% of best performing system in a zero-shot setting. We provide a detailed analysis of experimental results to understand the potential and limitations of the approach, and discuss future steps towards modality-agnostic classifiers.</description><subject>Acoustics</subject><subject>Audio data</subject><subject>Audio equipment</subject><subject>Classifiers</subject><subject>Image classification</subject><subject>Information retrieval</subject><subject>Music</subject><subject>Musical instruments</subject><subject>Representations</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNytEKwiAUgGEJgkbtHYSuB06n674WPUD3Q5YOh_OYx0G9fSN6gK7-i-_fkIILUVenhvMdKREnxhhXLZdSFOTSvaKH5MJIZ3ho7_K70mMAzG6gycRk0ISss4OA1EKi84KrDF4jOuuGrxzI1mqPpvx1T47X7n6-VTHBczGY-wmWFFbquRRKtbVkQvx3fQDvHDvv</recordid><startdate>20210602</startdate><enddate>20210602</enddate><creator>Ho-Hsiang, Wu</creator><creator>Fuentes, Magdalena</creator><creator>Bello, Juan P</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20210602</creationdate><title>Exploring modality-agnostic representations for music classification</title><author>Ho-Hsiang, Wu ; Fuentes, Magdalena ; Bello, Juan P</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_25366715033</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Acoustics</topic><topic>Audio data</topic><topic>Audio equipment</topic><topic>Classifiers</topic><topic>Image classification</topic><topic>Information retrieval</topic><topic>Music</topic><topic>Musical instruments</topic><topic>Representations</topic><toplevel>online_resources</toplevel><creatorcontrib>Ho-Hsiang, Wu</creatorcontrib><creatorcontrib>Fuentes, Magdalena</creatorcontrib><creatorcontrib>Bello, Juan P</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ho-Hsiang, Wu</au><au>Fuentes, Magdalena</au><au>Bello, Juan P</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Exploring modality-agnostic representations for music classification</atitle><jtitle>arXiv.org</jtitle><date>2021-06-02</date><risdate>2021</risdate><eissn>2331-8422</eissn><abstract>Music information is often conveyed or recorded across multiple data modalities including but not limited to audio, images, text and scores. However, music information retrieval research has almost exclusively focused on single modality recognition, requiring development of separate models for each modality. Some multi-modal works require multiple coexisting modalities given to the model as inputs, constraining the use of these models to the few cases where data from all modalities are available. To the best of our knowledge, no existing model has the ability to take inputs from varying modalities, e.g. images or sounds, and classify them into unified music categories. We explore the use of cross-modal retrieval as a pretext task to learn modality-agnostic representations, which can then be used as inputs to classifiers that are independent of modality. We select instrument classification as an example task for our study as both visual and audio components provide relevant semantic information. We train music instrument classifiers that can take both images or sounds as input, and perform comparably to sound-only or image-only classifiers. Furthermore, we explore the case when there is limited labeled data for a given modality, and the impact in performance by using labeled data from other modalities. We are able to achieve almost 70% of best performing system in a zero-shot setting. We provide a detailed analysis of experimental results to understand the potential and limitations of the approach, and discuss future steps towards modality-agnostic classifiers.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2021-06
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2536671503
source	Freely Accessible Journals
subjects	Acoustics Audio data Audio equipment Classifiers Image classification Information retrieval Music Musical instruments Representations
title	Exploring modality-agnostic representations for music classification
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T06%3A08%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Exploring%20modality-agnostic%20representations%20for%20music%20classification&rft.jtitle=arXiv.org&rft.au=Ho-Hsiang,%20Wu&rft.date=2021-06-02&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2536671503%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2536671503&rft_id=info:pmid/&rfr_iscdi=true