Generative Model Driven Representation Learning in a Hybrid Framework for Environmental Audio Scene and Sound Event Recognition

The analysis of sound information is helpful for audio surveillance, multimedia information retrieval, audio tagging, and forensic applications. Environmental audio scene recognition (EASR) and sound event recognition (SER) for audio surveillance are challenging tasks due to the presence of multiple...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on multimedia 2020-01, Vol.22 (1), p.3-14
Hauptverfasser: Chandrakala, S., Jayalakshmi, S. L.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The analysis of sound information is helpful for audio surveillance, multimedia information retrieval, audio tagging, and forensic applications. Environmental audio scene recognition (EASR) and sound event recognition (SER) for audio surveillance are challenging tasks due to the presence of multiple sound sources, background noises, and the existence of overlapping or polyphonic contexts. We focus on learning robust and compact representations for environmental audio scenes and sound events using mel-frequency cepstral coefficients as basic features, which have proved to be effective in speech and audio-related tasks. In this paper, we propose a common hybrid model-based framework that learns representations with the help of generative models. We explore instance-specific adapted Gaussian mixture models for environmental audio scenes and instance-specific hidden Markov models for sound events to compute a robust, compact, and discriminatory representations. A discriminative model based classifier is then used to recognize these representations as environmental audio scenes and sound events. The performance of the proposed approaches is evaluated using the DCASE2013 scene dataset and TUT-DCASE2016 scene dataset for EASR task. Environmental Sound Classification (ESC-10) and UrbanSound8K datasets are used for SER task. The recognition accuracy of the proposed framework is significantly better than many of the state-of-the-art approaches proposed in the recent literature. The discriminative nature of the model-driven representations leads to improved efficiency for EASR and SER task. The proposed approaches are more suitable for tasks with less training data.
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2019.2925956