Pre-training with Synthetic Patterns for Audio
In this paper, we propose to pre-train audio encoders using synthetic patterns instead of real audio data. Our proposed framework consists of two key elements. The first one is Masked Autoencoder (MAE), a self-supervised learning framework that learns from reconstructing data from randomly masked co...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this paper, we propose to pre-train audio encoders using synthetic
patterns instead of real audio data. Our proposed framework consists of two key
elements. The first one is Masked Autoencoder (MAE), a self-supervised learning
framework that learns from reconstructing data from randomly masked
counterparts. MAEs tend to focus on low-level information such as visual
patterns and regularities within data. Therefore, it is unimportant what is
portrayed in the input, whether it be images, audio mel-spectrograms, or even
synthetic patterns. This leads to the second key element, which is synthetic
data. Synthetic data, unlike real audio, is free from privacy and licensing
infringement issues. By combining MAEs and synthetic patterns, our framework
enables the model to learn generalized feature representations without real
data, while addressing the issues related to real audio. To evaluate the
efficacy of our framework, we conduct extensive experiments across a total of
13 audio tasks and 17 synthetic datasets. The experiments provide insights into
which types of synthetic patterns are effective for audio. Our results
demonstrate that our framework achieves performance comparable to models
pre-trained on AudioSet-2M and partially outperforms image-based pre-training
methods. |
---|---|
DOI: | 10.48550/arxiv.2410.00511 |