Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision
Humans do not acquire perceptual abilities in the way we train machines. While machine learning algorithms typically operate on large collections of randomly-chosen, explicitly-labeled examples, human acquisition relies more heavily on multimodal unsupervised learning (as infants) and active learnin...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Humans do not acquire perceptual abilities in the way we train machines.
While machine learning algorithms typically operate on large collections of
randomly-chosen, explicitly-labeled examples, human acquisition relies more
heavily on multimodal unsupervised learning (as infants) and active learning
(as children). With this motivation, we present a learning framework for sound
representation and recognition that combines (i) a self-supervised objective
based on a general notion of unimodal and cross-modal coincidence, (ii) a
clustering objective that reflects our need to impose categorical structure on
our experiences, and (iii) a cluster-based active learning procedure that
solicits targeted weak supervision to consolidate categories into relevant
semantic classes. By training a combined sound
embedding/clustering/classification network according to these criteria, we
achieve a new state-of-the-art unsupervised audio representation and
demonstrate up to a 20-fold reduction in the number of labels required to reach
a desired classification performance. |
---|---|
DOI: | 10.48550/arxiv.1911.05894 |