VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
In this work, we systematically study music generation conditioned solely on the video. First, we present a large-scale dataset comprising 360K video-music pairs, including various genres such as movie trailers, advertisements, and documentaries. Furthermore, we propose VidMuse, a simple framework f...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this work, we systematically study music generation conditioned solely on
the video. First, we present a large-scale dataset comprising 360K video-music
pairs, including various genres such as movie trailers, advertisements, and
documentaries. Furthermore, we propose VidMuse, a simple framework for
generating music aligned with video inputs. VidMuse stands out by producing
high-fidelity music that is both acoustically and semantically aligned with the
video. By incorporating local and global visual cues, VidMuse enables the
creation of musically coherent audio tracks that consistently match the video
content through Long-Short-Term modeling. Through extensive experiments,
VidMuse outperforms existing models in terms of audio quality, diversity, and
audio-visual alignment. The code and datasets will be available at
https://github.com/ZeyueT/VidMuse/. |
---|---|
DOI: | 10.48550/arxiv.2406.04321 |