VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
In this work, we systematically study music generation conditioned solely on the video. First, we present a large-scale dataset comprising 360K video-music pairs, including various genres such as movie trailers, advertisements, and documentaries. Furthermore, we propose VidMuse, a simple framework f...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Tian, Zeyue Liu, Zhaoyang Yuan, Ruibin Pan, Jiahao Liu, Qifeng Tan, Xu Chen, Qifeng Xue, Wei Guo, Yike |
description | In this work, we systematically study music generation conditioned solely on
the video. First, we present a large-scale dataset comprising 360K video-music
pairs, including various genres such as movie trailers, advertisements, and
documentaries. Furthermore, we propose VidMuse, a simple framework for
generating music aligned with video inputs. VidMuse stands out by producing
high-fidelity music that is both acoustically and semantically aligned with the
video. By incorporating local and global visual cues, VidMuse enables the
creation of musically coherent audio tracks that consistently match the video
content through Long-Short-Term modeling. Through extensive experiments,
VidMuse outperforms existing models in terms of audio quality, diversity, and
audio-visual alignment. The code and datasets will be available at
https://github.com/ZeyueT/VidMuse/. |
doi_str_mv | 10.48550/arxiv.2406.04321 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2406_04321</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2406_04321</sourcerecordid><originalsourceid>FETCH-LOGICAL-a671-ce218d81db4a7f920e7dc1739b411913b792977f8c246c84057a454aca8b2e553</originalsourceid><addsrcrecordid>eNotz71OwzAYhWEvDKhwAUz4Bhz8G9tsVUULUiqGRKyRY39pLZK4cgOFu6cUpiO9w5EehO4YLaRRij64_BU_Cy5pWVApOLtG9VsM248jPOIlruN4GACfCyQyJ3Lu0eMNTJDdHNOE19mNcEr5HZ_ivMdVmnak3qc8kwbyiLcpwBCn3Q266t1whNv_XaBm_dSsnkn1unlZLSviSs2IB85MMCx00unecgo6eKaF7SRjlolOW2617o3nsvRGUqWdVNJ5ZzoOSokFuv-7vajaQ46jy9_tr6696MQPi59JTw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling</title><source>arXiv.org</source><creator>Tian, Zeyue ; Liu, Zhaoyang ; Yuan, Ruibin ; Pan, Jiahao ; Liu, Qifeng ; Tan, Xu ; Chen, Qifeng ; Xue, Wei ; Guo, Yike</creator><creatorcontrib>Tian, Zeyue ; Liu, Zhaoyang ; Yuan, Ruibin ; Pan, Jiahao ; Liu, Qifeng ; Tan, Xu ; Chen, Qifeng ; Xue, Wei ; Guo, Yike</creatorcontrib><description>In this work, we systematically study music generation conditioned solely on
the video. First, we present a large-scale dataset comprising 360K video-music
pairs, including various genres such as movie trailers, advertisements, and
documentaries. Furthermore, we propose VidMuse, a simple framework for
generating music aligned with video inputs. VidMuse stands out by producing
high-fidelity music that is both acoustically and semantically aligned with the
video. By incorporating local and global visual cues, VidMuse enables the
creation of musically coherent audio tracks that consistently match the video
content through Long-Short-Term modeling. Through extensive experiments,
VidMuse outperforms existing models in terms of audio quality, diversity, and
audio-visual alignment. The code and datasets will be available at
https://github.com/ZeyueT/VidMuse/.</description><identifier>DOI: 10.48550/arxiv.2406.04321</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Learning ; Computer Science - Multimedia ; Computer Science - Sound</subject><creationdate>2024-06</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2406.04321$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2406.04321$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Tian, Zeyue</creatorcontrib><creatorcontrib>Liu, Zhaoyang</creatorcontrib><creatorcontrib>Yuan, Ruibin</creatorcontrib><creatorcontrib>Pan, Jiahao</creatorcontrib><creatorcontrib>Liu, Qifeng</creatorcontrib><creatorcontrib>Tan, Xu</creatorcontrib><creatorcontrib>Chen, Qifeng</creatorcontrib><creatorcontrib>Xue, Wei</creatorcontrib><creatorcontrib>Guo, Yike</creatorcontrib><title>VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling</title><description>In this work, we systematically study music generation conditioned solely on
the video. First, we present a large-scale dataset comprising 360K video-music
pairs, including various genres such as movie trailers, advertisements, and
documentaries. Furthermore, we propose VidMuse, a simple framework for
generating music aligned with video inputs. VidMuse stands out by producing
high-fidelity music that is both acoustically and semantically aligned with the
video. By incorporating local and global visual cues, VidMuse enables the
creation of musically coherent audio tracks that consistently match the video
content through Long-Short-Term modeling. Through extensive experiments,
VidMuse outperforms existing models in terms of audio quality, diversity, and
audio-visual alignment. The code and datasets will be available at
https://github.com/ZeyueT/VidMuse/.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Multimedia</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz71OwzAYhWEvDKhwAUz4Bhz8G9tsVUULUiqGRKyRY39pLZK4cgOFu6cUpiO9w5EehO4YLaRRij64_BU_Cy5pWVApOLtG9VsM248jPOIlruN4GACfCyQyJ3Lu0eMNTJDdHNOE19mNcEr5HZ_ivMdVmnak3qc8kwbyiLcpwBCn3Q266t1whNv_XaBm_dSsnkn1unlZLSviSs2IB85MMCx00unecgo6eKaF7SRjlolOW2617o3nsvRGUqWdVNJ5ZzoOSokFuv-7vajaQ46jy9_tr6696MQPi59JTw</recordid><startdate>20240606</startdate><enddate>20240606</enddate><creator>Tian, Zeyue</creator><creator>Liu, Zhaoyang</creator><creator>Yuan, Ruibin</creator><creator>Pan, Jiahao</creator><creator>Liu, Qifeng</creator><creator>Tan, Xu</creator><creator>Chen, Qifeng</creator><creator>Xue, Wei</creator><creator>Guo, Yike</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240606</creationdate><title>VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling</title><author>Tian, Zeyue ; Liu, Zhaoyang ; Yuan, Ruibin ; Pan, Jiahao ; Liu, Qifeng ; Tan, Xu ; Chen, Qifeng ; Xue, Wei ; Guo, Yike</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a671-ce218d81db4a7f920e7dc1739b411913b792977f8c246c84057a454aca8b2e553</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Multimedia</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Tian, Zeyue</creatorcontrib><creatorcontrib>Liu, Zhaoyang</creatorcontrib><creatorcontrib>Yuan, Ruibin</creatorcontrib><creatorcontrib>Pan, Jiahao</creatorcontrib><creatorcontrib>Liu, Qifeng</creatorcontrib><creatorcontrib>Tan, Xu</creatorcontrib><creatorcontrib>Chen, Qifeng</creatorcontrib><creatorcontrib>Xue, Wei</creatorcontrib><creatorcontrib>Guo, Yike</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Tian, Zeyue</au><au>Liu, Zhaoyang</au><au>Yuan, Ruibin</au><au>Pan, Jiahao</au><au>Liu, Qifeng</au><au>Tan, Xu</au><au>Chen, Qifeng</au><au>Xue, Wei</au><au>Guo, Yike</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling</atitle><date>2024-06-06</date><risdate>2024</risdate><abstract>In this work, we systematically study music generation conditioned solely on
the video. First, we present a large-scale dataset comprising 360K video-music
pairs, including various genres such as movie trailers, advertisements, and
documentaries. Furthermore, we propose VidMuse, a simple framework for
generating music aligned with video inputs. VidMuse stands out by producing
high-fidelity music that is both acoustically and semantically aligned with the
video. By incorporating local and global visual cues, VidMuse enables the
creation of musically coherent audio tracks that consistently match the video
content through Long-Short-Term modeling. Through extensive experiments,
VidMuse outperforms existing models in terms of audio quality, diversity, and
audio-visual alignment. The code and datasets will be available at
https://github.com/ZeyueT/VidMuse/.</abstract><doi>10.48550/arxiv.2406.04321</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2406.04321 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2406_04321 |
source | arXiv.org |
subjects | Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning Computer Science - Multimedia Computer Science - Sound |
title | VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-04T21%3A24%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=VidMuse:%20A%20Simple%20Video-to-Music%20Generation%20Framework%20with%20Long-Short-Term%20Modeling&rft.au=Tian,%20Zeyue&rft.date=2024-06-06&rft_id=info:doi/10.48550/arxiv.2406.04321&rft_dat=%3Carxiv_GOX%3E2406_04321%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |