Translating speech with just images

Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the a...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Oneata, Dan, Kamper, Herman
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Oneata, Dan
Kamper, Herman
description Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yor\`ub\'a, and propose a Yor\`ub\'a-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form.
doi_str_mv 10.48550/arxiv.2406.07133
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2406_07133</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2406_07133</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-6665fd770b05270a841e9c4754a19e7c3fca0fc1325dc63e3949206e36c6af233</originalsourceid><addsrcrecordid>eNotzj1vwjAUhWEvDFXoD-hEJOaEa1_7mowoKgUJiSV7dHHsEAQoikM__n1bYDrbex4h3iTkemkMLHj47j5zpYFysBLxRcyrga_xzGN3bdPYe--O6Vc3HtPTLY5pd-HWx6mYBD5H__rcRFTr96rcZLv9x7Zc7TImixkRmdBYCwcwygIvtfSF09ZoloW3DoNjCE6iMo0j9FjoQgF5JEccFGIiZo_sXVn3w9_78FP_a-u7Fn8BYBk4iA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Translating speech with just images</title><source>arXiv.org</source><creator>Oneata, Dan ; Kamper, Herman</creator><creatorcontrib>Oneata, Dan ; Kamper, Herman</creatorcontrib><description>Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yor\`ub\'a, and propose a Yor\`ub\'a-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form.</description><identifier>DOI: 10.48550/arxiv.2406.07133</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Sound</subject><creationdate>2024-06</creationdate><rights>http://creativecommons.org/licenses/by-sa/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2406.07133$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2406.07133$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Oneata, Dan</creatorcontrib><creatorcontrib>Kamper, Herman</creatorcontrib><title>Translating speech with just images</title><description>Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yor\`ub\'a, and propose a Yor\`ub\'a-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzj1vwjAUhWEvDFXoD-hEJOaEa1_7mowoKgUJiSV7dHHsEAQoikM__n1bYDrbex4h3iTkemkMLHj47j5zpYFysBLxRcyrga_xzGN3bdPYe--O6Vc3HtPTLY5pd-HWx6mYBD5H__rcRFTr96rcZLv9x7Zc7TImixkRmdBYCwcwygIvtfSF09ZoloW3DoNjCE6iMo0j9FjoQgF5JEccFGIiZo_sXVn3w9_78FP_a-u7Fn8BYBk4iA</recordid><startdate>20240611</startdate><enddate>20240611</enddate><creator>Oneata, Dan</creator><creator>Kamper, Herman</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240611</creationdate><title>Translating speech with just images</title><author>Oneata, Dan ; Kamper, Herman</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-6665fd770b05270a841e9c4754a19e7c3fca0fc1325dc63e3949206e36c6af233</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Oneata, Dan</creatorcontrib><creatorcontrib>Kamper, Herman</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Oneata, Dan</au><au>Kamper, Herman</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Translating speech with just images</atitle><date>2024-06-11</date><risdate>2024</risdate><abstract>Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yor\`ub\'a, and propose a Yor\`ub\'a-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form.</abstract><doi>10.48550/arxiv.2406.07133</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2406.07133
ispartof
issn
language eng
recordid cdi_arxiv_primary_2406_07133
source arXiv.org
subjects Computer Science - Computation and Language
Computer Science - Sound
title Translating speech with just images
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T02%3A23%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Translating%20speech%20with%20just%20images&rft.au=Oneata,%20Dan&rft.date=2024-06-11&rft_id=info:doi/10.48550/arxiv.2406.07133&rft_dat=%3Carxiv_GOX%3E2406_07133%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true