Visually grounded few-shot word learning in low-resource settings
We propose a visually grounded speech model that learns new words and their visual depictions from just a few word-image example pairs. Given a set of test images and a spoken query, we ask the model which image depicts the query word. Previous work has simplified this few-shot learning problem by e...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We propose a visually grounded speech model that learns new words and their
visual depictions from just a few word-image example pairs. Given a set of test
images and a spoken query, we ask the model which image depicts the query word.
Previous work has simplified this few-shot learning problem by either using an
artificial setting with digit word-image pairs or by using a large number of
examples per class. Moreover, all previous studies were performed using English
speech-image data. We propose an approach that can work on natural word-image
pairs but with less examples, i.e. fewer shots, and then illustrate how this
approach can be applied for multimodal few-shot learning in a real low-resource
language, Yor\`ub\'a. Our approach involves using the given word-image example
pairs to mine new unsupervised word-image training pairs from large collections
of unlabelled speech and images. Additionally, we use a word-to-image attention
mechanism to determine word-image similarity. With this new model, we achieve
better performance with fewer shots than previous approaches on an existing
English benchmark. Many of the model's mistakes are due to confusion between
visual concepts co-occurring in similar contexts. The experiments on Yor\`ub\'a
show the benefit of transferring knowledge from a multimodal model trained on a
larger set of English speech-image data. |
---|---|
DOI: | 10.48550/arxiv.2306.11371 |