Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations
We present Sketch2Sound, a generative audio model capable of creating high-quality sounds from a set of interpretable time-varying control signals: loudness, brightness, and pitch, as well as text prompts. Sketch2Sound can synthesize arbitrary sounds from sonic imitations (i.e.,~a vocal imitation or...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We present Sketch2Sound, a generative audio model capable of creating
high-quality sounds from a set of interpretable time-varying control signals:
loudness, brightness, and pitch, as well as text prompts. Sketch2Sound can
synthesize arbitrary sounds from sonic imitations (i.e.,~a vocal imitation or a
reference sound-shape). Sketch2Sound can be implemented on top of any
text-to-audio latent diffusion transformer (DiT), and requires only 40k steps
of fine-tuning and a single linear layer per control, making it more
lightweight than existing methods like ControlNet. To synthesize from
sketchlike sonic imitations, we propose applying random median filters to the
control signals during training, allowing Sketch2Sound to be prompted using
controls with flexible levels of temporal specificity. We show that
Sketch2Sound can synthesize sounds that follow the gist of input controls from
a vocal imitation while retaining the adherence to an input text prompt and
audio quality compared to a text-only baseline. Sketch2Sound allows sound
artists to create sounds with the semantic flexibility of text prompts and the
expressivity and precision of a sonic gesture or vocal imitation. Sound
examples are available at https://hugofloresgarcia.art/sketch2sound/. |
---|---|
DOI: | 10.48550/arxiv.2412.08550 |