Cross-Lingual Visual Grounding

Visual grounding is a vision and language understanding task aiming at locating a region in an image according to a specific query phrase. However, most previous studies only address this task for the English language. Although there are previous cross-lingual vision and language studies, they work...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2021, Vol.9, p.349-358
Hauptverfasser: Dong, Wenjian, Otani, Mayu, Garcia, Noa, Nakashima, Yuta, Chu, Chenhui
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Visual grounding is a vision and language understanding task aiming at locating a region in an image according to a specific query phrase. However, most previous studies only address this task for the English language. Although there are previous cross-lingual vision and language studies, they work on image and video captioning, and visual question answering. In this paper, we present the first work on cross-lingual visual grounding to expand the task to different languages to study an effective yet efficient way for visual grounding on other languages. We construct a visual grounding dataset for French via crowdsourcing. Our dataset consists of 14k, 3k, and 3k query phrases with their corresponding image regions for 5k, 1k, and 1k training, validation and test images, respectively. In addition, we propose a cross-lingual visual grounding approach that transfers the knowledge from a learnt English model to a French model. Despite that the size of our French dataset is 1/6 of the English dataset, experiments indicate that our model achieves an accuracy of 65.17%, which is comparable to the accuracy 69.04% of the English model. Our dataset and codes are available at https://github.com/ids-cv/Multi-Lingual-Visual-Grounding .
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2020.3046719