Blind Audiovisual Source Separation Based on Sparse Redundant Representations

In this paper, we propose a novel method which is able to detect and separate audiovisual sources present in a scene. Our method exploits the correlation between the video signal captured with a camera and a synchronously recorded one-microphone audio track. In a first stage, audio and video modalit...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on multimedia 2010-08, Vol.12 (5), p.358-371
Hauptverfasser:	Llagostera Casanovas, Anna, Monaci, Gianluca, Vandergheynst, Pierre, Gribonval, Rémi
Format:	Artikel
Sprache:	eng
Schlagworte:	Acoustic signal detection Acoustic testing Algorithms Applied sciences Audio signals Audiovisual Audiovisual processing blind source separation Cameras Clustering algorithms Computer Science Computer science control theory systems Construction Data processing. List processing. Character string processing Detection, estimation, filtering, equalization, prediction Exact sciences and technology Gaussian mixture models Image reconstruction Information Theory Information, signal and communications theory Instruments Layout Loudspeakers Mathematical models Mathematics Memory organisation. Data processing Redundant Representations Robustness Signal and communications theory Signal, noise Software Source separation sparse signal representation Studies Telecommunications and information theory
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In this paper, we propose a novel method which is able to detect and separate audiovisual sources present in a scene. Our method exploits the correlation between the video signal captured with a camera and a synchronously recorded one-microphone audio track. In a first stage, audio and video modalities are decomposed into relevant basic structures using redundant representations. Next, synchrony between relevant events in audio and video modalities is quantified. Based on this co-occurrence measure, audiovisual sources are counted and located in the image using a robust clustering algorithm that groups video structures exhibiting strong correlations with the audio. Next periods where each source is active alone are determined and used to build spectral Gaussian mixture models (GMMs) characterizing the sources acoustic behavior. Finally, these models are used to separate the audio signal in periods during which several sources are mixed. The proposed approach has been extensively tested on synthetic and natural sequences composed of speakers and music instruments. Results show that the proposed method is able to successfully detect, localize, separate, and reconstruct present audiovisual sources.
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2010.2050650