A Computational Model of Early Auditory-Visual Integration

We introduce a computational model of sensor fusion based on the topographic representations of a ”two-microphone and one camera” configuration. Our aim is to perform a robust multimodal attention-mechanism in artificial systems. In our approach, we consider neurophysiological findings to discuss th...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Schauer, C., Gross, H. -M.
Format:	Buchkapitel
Sprache:	eng
Schlagworte:	Applied sciences Artificial intelligence Computer science control theory systems Exact sciences and technology Inferior Colliculus Multisensory Integration Pattern recognition. Digital image processing. Computational geometry Salient Object Sound Localization Superior Colliculus
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	We introduce a computational model of sensor fusion based on the topographic representations of a ”two-microphone and one camera” configuration. Our aim is to perform a robust multimodal attention-mechanism in artificial systems. In our approach, we consider neurophysiological findings to discuss the biological plausibility of the coding and extraction of spatial features, but also meet the demands and constraints of applications in the field of human-robot interaction. In contrast to the common technique of processing different modalities separately and finally combine multiple localization hypotheses, we integrate auditory and visual data on an early level. This can be considered as focusing the attention or controlling the gaze onto salient objects. Our computational model is inspired by findings about the inferior colliculus in the auditory pathway and the visual and multimodal sections of the superior colliculus. Accordingly it includes: a) an auditory map, based on interaural time delays, b) a visual map, based on spatio-temporal intensity difference and c) a bimodal map where multisensory response enhancement is performed and motor-commands can be derived. Along with our experiments, questions arise about the spatial and temporal nature of audio-visual information: Which precision or what shape of receptive fields are suitable for grouping different types of multimodal events? What are useful time windows for multisensory interaction? These questions are rarely discussed in the context of computer vision and sound localization, but seem to be essential for the understanding of multimodal perception and the design of appropriate models.
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-540-45243-0_47