Catch-A-Waveform: Learning to Generate Audio from a Single Short Example
Models for audio generation are typically trained on hours of recordings. Here, we illustrate that capturing the essence of an audio source is typically possible from as little as a few tens of seconds from a single training signal. Specifically, we present a GAN-based generative model that can be t...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Models for audio generation are typically trained on hours of recordings.
Here, we illustrate that capturing the essence of an audio source is typically
possible from as little as a few tens of seconds from a single training signal.
Specifically, we present a GAN-based generative model that can be trained on
one short audio signal from any domain (e.g. speech, music, etc.) and does not
require pre-training or any other form of external supervision. Once trained,
our model can generate random samples of arbitrary duration that maintain
semantic similarity to the training waveform, yet exhibit new compositions of
its audio primitives. This enables a long line of interesting applications,
including generating new jazz improvisations or new a-cappella rap variants
based on a single short example, producing coherent modifications to famous
songs (e.g. adding a new verse to a Beatles song based solely on the original
recording), filling-in of missing parts (inpainting), extending the bandwidth
of a speech signal (super-resolution), and enhancing old recordings without
access to any clean training example. We show that in all cases, no more than
20 seconds of training audio commonly suffice for our model to achieve
state-of-the-art results. This is despite its complete lack of prior knowledge
about the nature of audio signals in general. |
---|---|
DOI: | 10.48550/arxiv.2106.06426 |