Translating speech with just images

Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the a...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Oneata, Dan, Kamper, Herman
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Sound
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Oneata, Dan Kamper, Herman
description	Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yor\`ub\'a, and propose a Yor\`ub\'a-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form.
doi_str_mv	10.48550/arxiv.2406.07133
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2406_07133</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2406_07133</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-6665fd770b05270a841e9c4754a19e7c3fca0fc1325dc63e3949206e36c6af233</originalsourceid><addsrcrecordid>eNotzj1vwjAUhWEvDFXoD-hEJOaEa1_7mowoKgUJiSV7dHHsEAQoikM__n1bYDrbex4h3iTkemkMLHj47j5zpYFysBLxRcyrga_xzGN3bdPYe--O6Vc3HtPTLY5pd-HWx6mYBD5H__rcRFTr96rcZLv9x7Zc7TImixkRmdBYCwcwygIvtfSF09ZoloW3DoNjCE6iMo0j9FjoQgF5JEccFGIiZo_sXVn3w9_78FP_a-u7Fn8BYBk4iA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Translating speech with just images</title><source>arXiv.org</source><creator>Oneata, Dan ; Kamper, Herman</creator><creatorcontrib>Oneata, Dan ; Kamper, Herman</creatorcontrib><description>Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yor\`ub\'a, and propose a Yor\`ub\'a-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form.</description><identifier>DOI: 10.48550/arxiv.2406.07133</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Sound</subject><creationdate>2024-06</creationdate><rights>http://creativecommons.org/licenses/by-sa/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2406.07133$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2406.07133$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Oneata, Dan</creatorcontrib><creatorcontrib>Kamper, Herman</creatorcontrib><title>Translating speech with just images</title><description>Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yor\`ub\'a, and propose a Yor\`ub\'a-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzj1vwjAUhWEvDFXoD-hEJOaEa1_7mowoKgUJiSV7dHHsEAQoikM__n1bYDrbex4h3iTkemkMLHj47j5zpYFysBLxRcyrga_xzGN3bdPYe--O6Vc3HtPTLY5pd-HWx6mYBD5H__rcRFTr96rcZLv9x7Zc7TImixkRmdBYCwcwygIvtfSF09ZoloW3DoNjCE6iMo0j9FjoQgF5JEccFGIiZo_sXVn3w9_78FP_a-u7Fn8BYBk4iA</recordid><startdate>20240611</startdate><enddate>20240611</enddate><creator>Oneata, Dan</creator><creator>Kamper, Herman</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240611</creationdate><title>Translating speech with just images</title><author>Oneata, Dan ; Kamper, Herman</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-6665fd770b05270a841e9c4754a19e7c3fca0fc1325dc63e3949206e36c6af233</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Oneata, Dan</creatorcontrib><creatorcontrib>Kamper, Herman</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Oneata, Dan</au><au>Kamper, Herman</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Translating speech with just images</atitle><date>2024-06-11</date><risdate>2024</risdate><abstract>Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yor\`ub\'a, and propose a Yor\`ub\'a-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form.</abstract><doi>10.48550/arxiv.2406.07133</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2406.07133
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2406_07133
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Sound
title	Translating speech with just images
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T02%3A23%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Translating%20speech%20with%20just%20images&rft.au=Oneata,%20Dan&rft.date=2024-06-11&rft_id=info:doi/10.48550/arxiv.2406.07133&rft_dat=%3Carxiv_GOX%3E2406_07133%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true