Computational framework for fusing eye movements and spoken narratives for image annotation

Despite many recent advances in the field of computer vision, there remains a disconnect between how computers process images and how humans understand them. To begin to bridge this gap, we propose a framework that integrates human-elicited gaze and spoken language to label perceptually important re...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of vision (Charlottesville, Va.) Va.), 2020-07, Vol.20 (7), p.13-13, Article 13
Hauptverfasser:	Vaidyanathan, Preethi, Prud'hommeaux, Emily, Alm, Cecilia O., Pelz, Jeff B.
Format:	Artikel
Sprache:	eng
Schlagworte:	Adolescent Adult Data Curation Databases, Factual Eye Movements - physiology Female Humans Life Sciences & Biomedicine Male Neural Networks, Computer Ophthalmology Science & Technology Semantics Speech Perception - physiology Young Adult
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Despite many recent advances in the field of computer vision, there remains a disconnect between how computers process images and how humans understand them. To begin to bridge this gap, we propose a framework that integrates human-elicited gaze and spoken language to label perceptually important regions in an image. Our work relies on the notion that gaze and spoken narratives can jointly model how humans inspect and analyze images. Using an unsupervised bitext alignment algorithm originally developed for machine translation, we create meaningful mappings between participants' eye movements over an image and their spoken descriptions of that image. The resulting multimodal alignments are then used to annotate image regions with linguistic labels. The accuracy of these labels exceeds that of baseline alignments obtained using purely temporal correspondence between fixations and words. We also find differences in system performances when identifying image regions using clustering methods that rely on gaze information rather than image features. The alignments produced by our framework can be used to create a database of low-level image features and high-level semantic annotations corresponding to perceptually important image regions. The framework can potentially be applied to any multimodal data stream and to any visual domain. To this end, we provide the research community with access to the computational framework.
ISSN:	1534-7362 1534-7362
DOI:	10.1167/jov.20.7.13