Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings
Music generation introduces challenging complexities to large language models. Symbolic structures of music often include vertical harmonization as well as horizontal counterpoint, urging various adaptations and enhancements for large-scale Transformers. However, existing works share three major dra...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Music generation introduces challenging complexities to large language
models. Symbolic structures of music often include vertical harmonization as
well as horizontal counterpoint, urging various adaptations and enhancements
for large-scale Transformers. However, existing works share three major
drawbacks: 1) their tokenization requires domain-specific annotations, such as
bars and beats, that are typically missing in raw MIDI data; 2) the pure impact
of enhancing token embedding methods is hardly examined without domain-specific
annotations; and 3) existing works to overcome the aforementioned drawbacks,
such as MuseNet, lack reproducibility. To tackle such limitations, we develop a
MIDI-based music generation framework inspired by MuseNet, empirically studying
two structural embeddings that do not rely on domain-specific annotations. We
provide various metrics and insights that can guide suitable encoding to
deploy. We also verify that multiple embedding configurations can selectively
boost certain musical aspects. By providing open-source implementations via
HuggingFace, our findings shed light on leveraging large language models toward
practical and reproducible music generation. |
---|---|
DOI: | 10.48550/arxiv.2407.19900 |