Translating speech with just images
Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the a...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Visually grounded speech models link speech to images. We extend this
connection by linking images to text via an existing image captioning system,
and as a result gain the ability to map speech audio directly to text. This
approach can be used for speech translation with just images by having the
audio in a different language from the generated captions. We investigate such
a system on a real low-resource language, Yor\`ub\'a, and propose a
Yor\`ub\'a-to-English speech translation model that leverages pretrained
components in order to be able to learn in the low-resource regime. To limit
overfitting, we find that it is essential to use a decoding scheme that
produces diverse image captions for training. Results show that the predicted
translations capture the main semantics of the spoken audio, albeit in a
simpler and shorter form. |
---|---|
DOI: | 10.48550/arxiv.2406.07133 |