GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions
Composing music for video is essential yet challenging, leading to a growing interest in automating music generation for video applications. Existing approaches often struggle to achieve robust music-video correspondence and generative diversity, primarily due to inadequate feature alignment methods...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Zuo, Heda You, Weitao Wu, Junxian Ren, Shihong Chen, Pei Zhou, Mingxu Lu, Yujia Sun, Lingyun |
description | Composing music for video is essential yet challenging, leading to a growing
interest in automating music generation for video applications. Existing
approaches often struggle to achieve robust music-video correspondence and
generative diversity, primarily due to inadequate feature alignment methods and
insufficient datasets. In this study, we present General Video-to-Music
Generation model (GVMGen), designed for generating high-related music to the
video input. Our model employs hierarchical attentions to extract and align
video features with music in both spatial and temporal dimensions, ensuring the
preservation of pertinent features while minimizing redundancy. Remarkably, our
method is versatile, capable of generating multi-style music from different
video inputs, even in zero-shot scenarios. We also propose an evaluation model
along with two novel objective metrics for assessing video-music alignment.
Additionally, we have compiled a large-scale dataset comprising diverse types
of video-music pairs. Experimental results demonstrate that GVMGen surpasses
previous models in terms of music-video correspondence, generative diversity,
and application universality. |
doi_str_mv | 10.48550/arxiv.2501.09972 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2501_09972</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2501_09972</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2501_099723</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjUw1DOwtDQ34mTwcw_zdU_Ns1JwVABSqUWJOQphmSmp-bol-bq-pcWZyVDhksz8PAXf_JTUHIXyzJIMBY9MoGBRckZmMlCHY0lJah5IRTEPA2taYk5xKi-U5maQd3MNcfbQBdscX1CUmZtYVBkPckE82AXGhFUAANlEOv4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions</title><source>arXiv.org</source><creator>Zuo, Heda ; You, Weitao ; Wu, Junxian ; Ren, Shihong ; Chen, Pei ; Zhou, Mingxu ; Lu, Yujia ; Sun, Lingyun</creator><creatorcontrib>Zuo, Heda ; You, Weitao ; Wu, Junxian ; Ren, Shihong ; Chen, Pei ; Zhou, Mingxu ; Lu, Yujia ; Sun, Lingyun</creatorcontrib><description>Composing music for video is essential yet challenging, leading to a growing
interest in automating music generation for video applications. Existing
approaches often struggle to achieve robust music-video correspondence and
generative diversity, primarily due to inadequate feature alignment methods and
insufficient datasets. In this study, we present General Video-to-Music
Generation model (GVMGen), designed for generating high-related music to the
video input. Our model employs hierarchical attentions to extract and align
video features with music in both spatial and temporal dimensions, ensuring the
preservation of pertinent features while minimizing redundancy. Remarkably, our
method is versatile, capable of generating multi-style music from different
video inputs, even in zero-shot scenarios. We also propose an evaluation model
along with two novel objective metrics for assessing video-music alignment.
Additionally, we have compiled a large-scale dataset comprising diverse types
of video-music pairs. Experimental results demonstrate that GVMGen surpasses
previous models in terms of music-video correspondence, generative diversity,
and application universality.</description><identifier>DOI: 10.48550/arxiv.2501.09972</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Multimedia ; Computer Science - Sound</subject><creationdate>2025-01</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2501.09972$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2501.09972$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zuo, Heda</creatorcontrib><creatorcontrib>You, Weitao</creatorcontrib><creatorcontrib>Wu, Junxian</creatorcontrib><creatorcontrib>Ren, Shihong</creatorcontrib><creatorcontrib>Chen, Pei</creatorcontrib><creatorcontrib>Zhou, Mingxu</creatorcontrib><creatorcontrib>Lu, Yujia</creatorcontrib><creatorcontrib>Sun, Lingyun</creatorcontrib><title>GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions</title><description>Composing music for video is essential yet challenging, leading to a growing
interest in automating music generation for video applications. Existing
approaches often struggle to achieve robust music-video correspondence and
generative diversity, primarily due to inadequate feature alignment methods and
insufficient datasets. In this study, we present General Video-to-Music
Generation model (GVMGen), designed for generating high-related music to the
video input. Our model employs hierarchical attentions to extract and align
video features with music in both spatial and temporal dimensions, ensuring the
preservation of pertinent features while minimizing redundancy. Remarkably, our
method is versatile, capable of generating multi-style music from different
video inputs, even in zero-shot scenarios. We also propose an evaluation model
along with two novel objective metrics for assessing video-music alignment.
Additionally, we have compiled a large-scale dataset comprising diverse types
of video-music pairs. Experimental results demonstrate that GVMGen surpasses
previous models in terms of music-video correspondence, generative diversity,
and application universality.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Multimedia</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2025</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjUw1DOwtDQ34mTwcw_zdU_Ns1JwVABSqUWJOQphmSmp-bol-bq-pcWZyVDhksz8PAXf_JTUHIXyzJIMBY9MoGBRckZmMlCHY0lJah5IRTEPA2taYk5xKi-U5maQd3MNcfbQBdscX1CUmZtYVBkPckE82AXGhFUAANlEOv4</recordid><startdate>20250117</startdate><enddate>20250117</enddate><creator>Zuo, Heda</creator><creator>You, Weitao</creator><creator>Wu, Junxian</creator><creator>Ren, Shihong</creator><creator>Chen, Pei</creator><creator>Zhou, Mingxu</creator><creator>Lu, Yujia</creator><creator>Sun, Lingyun</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20250117</creationdate><title>GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions</title><author>Zuo, Heda ; You, Weitao ; Wu, Junxian ; Ren, Shihong ; Chen, Pei ; Zhou, Mingxu ; Lu, Yujia ; Sun, Lingyun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2501_099723</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2025</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Multimedia</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Zuo, Heda</creatorcontrib><creatorcontrib>You, Weitao</creatorcontrib><creatorcontrib>Wu, Junxian</creatorcontrib><creatorcontrib>Ren, Shihong</creatorcontrib><creatorcontrib>Chen, Pei</creatorcontrib><creatorcontrib>Zhou, Mingxu</creatorcontrib><creatorcontrib>Lu, Yujia</creatorcontrib><creatorcontrib>Sun, Lingyun</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zuo, Heda</au><au>You, Weitao</au><au>Wu, Junxian</au><au>Ren, Shihong</au><au>Chen, Pei</au><au>Zhou, Mingxu</au><au>Lu, Yujia</au><au>Sun, Lingyun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions</atitle><date>2025-01-17</date><risdate>2025</risdate><abstract>Composing music for video is essential yet challenging, leading to a growing
interest in automating music generation for video applications. Existing
approaches often struggle to achieve robust music-video correspondence and
generative diversity, primarily due to inadequate feature alignment methods and
insufficient datasets. In this study, we present General Video-to-Music
Generation model (GVMGen), designed for generating high-related music to the
video input. Our model employs hierarchical attentions to extract and align
video features with music in both spatial and temporal dimensions, ensuring the
preservation of pertinent features while minimizing redundancy. Remarkably, our
method is versatile, capable of generating multi-style music from different
video inputs, even in zero-shot scenarios. We also propose an evaluation model
along with two novel objective metrics for assessing video-music alignment.
Additionally, we have compiled a large-scale dataset comprising diverse types
of video-music pairs. Experimental results demonstrate that GVMGen surpasses
previous models in terms of music-video correspondence, generative diversity,
and application universality.</abstract><doi>10.48550/arxiv.2501.09972</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2501.09972 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2501_09972 |
source | arXiv.org |
subjects | Computer Science - Artificial Intelligence Computer Science - Multimedia Computer Science - Sound |
title | GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-19T09%3A39%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=GVMGen:%20A%20General%20Video-to-Music%20Generation%20Model%20with%20Hierarchical%20Attentions&rft.au=Zuo,%20Heda&rft.date=2025-01-17&rft_id=info:doi/10.48550/arxiv.2501.09972&rft_dat=%3Carxiv_GOX%3E2501_09972%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |