Rethinking CNN Models for Audio Classification
In this paper, we show that ImageNet-Pretrained standard deep CNN models can be used as strong baseline networks for audio classification. Even though there is a significant difference between audio Spectrogram and standard ImageNet image samples, transfer learning assumptions still hold firmly. To...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this paper, we show that ImageNet-Pretrained standard deep CNN models can
be used as strong baseline networks for audio classification. Even though there
is a significant difference between audio Spectrogram and standard ImageNet
image samples, transfer learning assumptions still hold firmly. To understand
what enables the ImageNet pretrained models to learn useful audio
representations, we systematically study how much of pretrained weights is
useful for learning spectrograms. We show (1) that for a given standard model
using pretrained weights is better than using randomly initialized weights (2)
qualitative results of what the CNNs learn from the spectrograms by visualizing
the gradients. Besides, we show that even though we use the pretrained model
weights for initialization, there is variance in performance in various output
runs of the same model. This variance in performance is due to the random
initialization of linear classification layer and random mini-batch orderings
in multiple runs. This brings significant diversity to build stronger ensemble
models with an overall improvement in accuracy. An ensemble of ImageNet
pretrained DenseNet achieves 92.89% validation accuracy on the ESC-50 dataset
and 87.42% validation accuracy on the UrbanSound8K dataset which is the current
state-of-the-art on both of these datasets. |
---|---|
DOI: | 10.48550/arxiv.2007.11154 |