GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions

Composing music for video is essential yet challenging, leading to a growing interest in automating music generation for video applications. Existing approaches often struggle to achieve robust music-video correspondence and generative diversity, primarily due to inadequate feature alignment methods...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Zuo, Heda, You, Weitao, Wu, Junxian, Ren, Shihong, Chen, Pei, Zhou, Mingxu, Lu, Yujia, Sun, Lingyun
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Multimedia Computer Science - Sound
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Zuo, Heda You, Weitao Wu, Junxian Ren, Shihong Chen, Pei Zhou, Mingxu Lu, Yujia Sun, Lingyun
description	Composing music for video is essential yet challenging, leading to a growing interest in automating music generation for video applications. Existing approaches often struggle to achieve robust music-video correspondence and generative diversity, primarily due to inadequate feature alignment methods and insufficient datasets. In this study, we present General Video-to-Music Generation model (GVMGen), designed for generating high-related music to the video input. Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions, ensuring the preservation of pertinent features while minimizing redundancy. Remarkably, our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios. We also propose an evaluation model along with two novel objective metrics for assessing video-music alignment. Additionally, we have compiled a large-scale dataset comprising diverse types of video-music pairs. Experimental results demonstrate that GVMGen surpasses previous models in terms of music-video correspondence, generative diversity, and application universality.
doi_str_mv	10.48550/arxiv.2501.09972
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2501_09972</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2501_09972</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2501_099723</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjUw1DOwtDQ34mTwcw_zdU_Ns1JwVABSqUWJOQphmSmp-bol-bq-pcWZyVDhksz8PAXf_JTUHIXyzJIMBY9MoGBRckZmMlCHY0lJah5IRTEPA2taYk5xKi-U5maQd3MNcfbQBdscX1CUmZtYVBkPckE82AXGhFUAANlEOv4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions</title><source>arXiv.org</source><creator>Zuo, Heda ; You, Weitao ; Wu, Junxian ; Ren, Shihong ; Chen, Pei ; Zhou, Mingxu ; Lu, Yujia ; Sun, Lingyun</creator><creatorcontrib>Zuo, Heda ; You, Weitao ; Wu, Junxian ; Ren, Shihong ; Chen, Pei ; Zhou, Mingxu ; Lu, Yujia ; Sun, Lingyun</creatorcontrib><description>Composing music for video is essential yet challenging, leading to a growing interest in automating music generation for video applications. Existing approaches often struggle to achieve robust music-video correspondence and generative diversity, primarily due to inadequate feature alignment methods and insufficient datasets. In this study, we present General Video-to-Music Generation model (GVMGen), designed for generating high-related music to the video input. Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions, ensuring the preservation of pertinent features while minimizing redundancy. Remarkably, our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios. We also propose an evaluation model along with two novel objective metrics for assessing video-music alignment. Additionally, we have compiled a large-scale dataset comprising diverse types of video-music pairs. Experimental results demonstrate that GVMGen surpasses previous models in terms of music-video correspondence, generative diversity, and application universality.</description><identifier>DOI: 10.48550/arxiv.2501.09972</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Multimedia ; Computer Science - Sound</subject><creationdate>2025-01</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2501.09972$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2501.09972$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zuo, Heda</creatorcontrib><creatorcontrib>You, Weitao</creatorcontrib><creatorcontrib>Wu, Junxian</creatorcontrib><creatorcontrib>Ren, Shihong</creatorcontrib><creatorcontrib>Chen, Pei</creatorcontrib><creatorcontrib>Zhou, Mingxu</creatorcontrib><creatorcontrib>Lu, Yujia</creatorcontrib><creatorcontrib>Sun, Lingyun</creatorcontrib><title>GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions</title><description>Composing music for video is essential yet challenging, leading to a growing interest in automating music generation for video applications. Existing approaches often struggle to achieve robust music-video correspondence and generative diversity, primarily due to inadequate feature alignment methods and insufficient datasets. In this study, we present General Video-to-Music Generation model (GVMGen), designed for generating high-related music to the video input. Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions, ensuring the preservation of pertinent features while minimizing redundancy. Remarkably, our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios. We also propose an evaluation model along with two novel objective metrics for assessing video-music alignment. Additionally, we have compiled a large-scale dataset comprising diverse types of video-music pairs. Experimental results demonstrate that GVMGen surpasses previous models in terms of music-video correspondence, generative diversity, and application universality.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Multimedia</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2025</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjUw1DOwtDQ34mTwcw_zdU_Ns1JwVABSqUWJOQphmSmp-bol-bq-pcWZyVDhksz8PAXf_JTUHIXyzJIMBY9MoGBRckZmMlCHY0lJah5IRTEPA2taYk5xKi-U5maQd3MNcfbQBdscX1CUmZtYVBkPckE82AXGhFUAANlEOv4</recordid><startdate>20250117</startdate><enddate>20250117</enddate><creator>Zuo, Heda</creator><creator>You, Weitao</creator><creator>Wu, Junxian</creator><creator>Ren, Shihong</creator><creator>Chen, Pei</creator><creator>Zhou, Mingxu</creator><creator>Lu, Yujia</creator><creator>Sun, Lingyun</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20250117</creationdate><title>GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions</title><author>Zuo, Heda ; You, Weitao ; Wu, Junxian ; Ren, Shihong ; Chen, Pei ; Zhou, Mingxu ; Lu, Yujia ; Sun, Lingyun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2501_099723</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2025</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Multimedia</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Zuo, Heda</creatorcontrib><creatorcontrib>You, Weitao</creatorcontrib><creatorcontrib>Wu, Junxian</creatorcontrib><creatorcontrib>Ren, Shihong</creatorcontrib><creatorcontrib>Chen, Pei</creatorcontrib><creatorcontrib>Zhou, Mingxu</creatorcontrib><creatorcontrib>Lu, Yujia</creatorcontrib><creatorcontrib>Sun, Lingyun</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zuo, Heda</au><au>You, Weitao</au><au>Wu, Junxian</au><au>Ren, Shihong</au><au>Chen, Pei</au><au>Zhou, Mingxu</au><au>Lu, Yujia</au><au>Sun, Lingyun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions</atitle><date>2025-01-17</date><risdate>2025</risdate><abstract>Composing music for video is essential yet challenging, leading to a growing interest in automating music generation for video applications. Existing approaches often struggle to achieve robust music-video correspondence and generative diversity, primarily due to inadequate feature alignment methods and insufficient datasets. In this study, we present General Video-to-Music Generation model (GVMGen), designed for generating high-related music to the video input. Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions, ensuring the preservation of pertinent features while minimizing redundancy. Remarkably, our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios. We also propose an evaluation model along with two novel objective metrics for assessing video-music alignment. Additionally, we have compiled a large-scale dataset comprising diverse types of video-music pairs. Experimental results demonstrate that GVMGen surpasses previous models in terms of music-video correspondence, generative diversity, and application universality.</abstract><doi>10.48550/arxiv.2501.09972</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2501.09972
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2501_09972
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Multimedia Computer Science - Sound
title	GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-19T09%3A39%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=GVMGen:%20A%20General%20Video-to-Music%20Generation%20Model%20with%20Hierarchical%20Attentions&rft.au=Zuo,%20Heda&rft.date=2025-01-17&rft_id=info:doi/10.48550/arxiv.2501.09972&rft_dat=%3Carxiv_GOX%3E2501_09972%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true