Zero-resource Speech Translation and Recognition with LLMs

Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Mundnich, Karel, Niu, Xing, Mathur, Prashant, Ronanki, Srikanth, Houston, Brady, Elluru, Veera Raghavendra, Das, Nilaksh, Hou, Zejiang, Huybrechts, Goeric, Bhatia, Anshu, Garcia-Romero, Daniel, Han, Kyu J, Kirchhoff, Katrin
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Mundnich, Karel Niu, Xing Mathur, Prashant Ronanki, Srikanth Houston, Brady Elluru, Veera Raghavendra Das, Nilaksh Hou, Zejiang Huybrechts, Goeric Bhatia, Anshu Garcia-Romero, Daniel Han, Kyu J Kirchhoff, Katrin
description	Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM. We perform several experiments both in ST and ASR to understand how to best train the model and what data has the most impact on performance in previously unseen languages. In ST, our best model is capable to achieve BLEU scores over 23 in CoVoST2 for two previously unseen languages, while in ASR, we achieve WERs of up to 28.2\%. We finally show that the performance of our system is bounded by the ability of the LLM to output text in the desired language.
doi_str_mv	10.48550/arxiv.2412.18566
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2412_18566</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2412_18566</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2412_185663</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjE00jO0MDUz42SwikotytctSi3OLy1KTlUILkhNTc5QCClKzCvOSSzJzM9TSMxLUQhKTc5Pz8sE88szSzIUfHx8i3kYWNMSc4pTeaE0N4O8m2uIs4cu2Jb4gqLM3MSiyniQbfFg24wJqwAAkJQ0ew</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Zero-resource Speech Translation and Recognition with LLMs</title><source>arXiv.org</source><creator>Mundnich, Karel ; Niu, Xing ; Mathur, Prashant ; Ronanki, Srikanth ; Houston, Brady ; Elluru, Veera Raghavendra ; Das, Nilaksh ; Hou, Zejiang ; Huybrechts, Goeric ; Bhatia, Anshu ; Garcia-Romero, Daniel ; Han, Kyu J ; Kirchhoff, Katrin</creator><creatorcontrib>Mundnich, Karel ; Niu, Xing ; Mathur, Prashant ; Ronanki, Srikanth ; Houston, Brady ; Elluru, Veera Raghavendra ; Das, Nilaksh ; Hou, Zejiang ; Huybrechts, Goeric ; Bhatia, Anshu ; Garcia-Romero, Daniel ; Han, Kyu J ; Kirchhoff, Katrin</creatorcontrib><description>Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM. We perform several experiments both in ST and ASR to understand how to best train the model and what data has the most impact on performance in previously unseen languages. In ST, our best model is capable to achieve BLEU scores over 23 in CoVoST2 for two previously unseen languages, while in ASR, we achieve WERs of up to 28.2\%. We finally show that the performance of our system is bounded by the ability of the LLM to output text in the desired language.</description><identifier>DOI: 10.48550/arxiv.2412.18566</identifier><language>eng</language><subject>Computer Science - Computation and Language</subject><creationdate>2024-12</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2412.18566$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2412.18566$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Mundnich, Karel</creatorcontrib><creatorcontrib>Niu, Xing</creatorcontrib><creatorcontrib>Mathur, Prashant</creatorcontrib><creatorcontrib>Ronanki, Srikanth</creatorcontrib><creatorcontrib>Houston, Brady</creatorcontrib><creatorcontrib>Elluru, Veera Raghavendra</creatorcontrib><creatorcontrib>Das, Nilaksh</creatorcontrib><creatorcontrib>Hou, Zejiang</creatorcontrib><creatorcontrib>Huybrechts, Goeric</creatorcontrib><creatorcontrib>Bhatia, Anshu</creatorcontrib><creatorcontrib>Garcia-Romero, Daniel</creatorcontrib><creatorcontrib>Han, Kyu J</creatorcontrib><creatorcontrib>Kirchhoff, Katrin</creatorcontrib><title>Zero-resource Speech Translation and Recognition with LLMs</title><description>Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM. We perform several experiments both in ST and ASR to understand how to best train the model and what data has the most impact on performance in previously unseen languages. In ST, our best model is capable to achieve BLEU scores over 23 in CoVoST2 for two previously unseen languages, while in ASR, we achieve WERs of up to 28.2\%. We finally show that the performance of our system is bounded by the ability of the LLM to output text in the desired language.</description><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjE00jO0MDUz42SwikotytctSi3OLy1KTlUILkhNTc5QCClKzCvOSSzJzM9TSMxLUQhKTc5Pz8sE88szSzIUfHx8i3kYWNMSc4pTeaE0N4O8m2uIs4cu2Jb4gqLM3MSiyniQbfFg24wJqwAAkJQ0ew</recordid><startdate>20241224</startdate><enddate>20241224</enddate><creator>Mundnich, Karel</creator><creator>Niu, Xing</creator><creator>Mathur, Prashant</creator><creator>Ronanki, Srikanth</creator><creator>Houston, Brady</creator><creator>Elluru, Veera Raghavendra</creator><creator>Das, Nilaksh</creator><creator>Hou, Zejiang</creator><creator>Huybrechts, Goeric</creator><creator>Bhatia, Anshu</creator><creator>Garcia-Romero, Daniel</creator><creator>Han, Kyu J</creator><creator>Kirchhoff, Katrin</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241224</creationdate><title>Zero-resource Speech Translation and Recognition with LLMs</title><author>Mundnich, Karel ; Niu, Xing ; Mathur, Prashant ; Ronanki, Srikanth ; Houston, Brady ; Elluru, Veera Raghavendra ; Das, Nilaksh ; Hou, Zejiang ; Huybrechts, Goeric ; Bhatia, Anshu ; Garcia-Romero, Daniel ; Han, Kyu J ; Kirchhoff, Katrin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2412_185663</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Mundnich, Karel</creatorcontrib><creatorcontrib>Niu, Xing</creatorcontrib><creatorcontrib>Mathur, Prashant</creatorcontrib><creatorcontrib>Ronanki, Srikanth</creatorcontrib><creatorcontrib>Houston, Brady</creatorcontrib><creatorcontrib>Elluru, Veera Raghavendra</creatorcontrib><creatorcontrib>Das, Nilaksh</creatorcontrib><creatorcontrib>Hou, Zejiang</creatorcontrib><creatorcontrib>Huybrechts, Goeric</creatorcontrib><creatorcontrib>Bhatia, Anshu</creatorcontrib><creatorcontrib>Garcia-Romero, Daniel</creatorcontrib><creatorcontrib>Han, Kyu J</creatorcontrib><creatorcontrib>Kirchhoff, Katrin</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Mundnich, Karel</au><au>Niu, Xing</au><au>Mathur, Prashant</au><au>Ronanki, Srikanth</au><au>Houston, Brady</au><au>Elluru, Veera Raghavendra</au><au>Das, Nilaksh</au><au>Hou, Zejiang</au><au>Huybrechts, Goeric</au><au>Bhatia, Anshu</au><au>Garcia-Romero, Daniel</au><au>Han, Kyu J</au><au>Kirchhoff, Katrin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Zero-resource Speech Translation and Recognition with LLMs</atitle><date>2024-12-24</date><risdate>2024</risdate><abstract>Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM. We perform several experiments both in ST and ASR to understand how to best train the model and what data has the most impact on performance in previously unseen languages. In ST, our best model is capable to achieve BLEU scores over 23 in CoVoST2 for two previously unseen languages, while in ASR, we achieve WERs of up to 28.2\%. We finally show that the performance of our system is bounded by the ability of the LLM to output text in the desired language.</abstract><doi>10.48550/arxiv.2412.18566</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2412.18566
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2412_18566
source	arXiv.org
subjects	Computer Science - Computation and Language
title	Zero-resource Speech Translation and Recognition with LLMs
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T05%3A12%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Zero-resource%20Speech%20Translation%20and%20Recognition%20with%20LLMs&rft.au=Mundnich,%20Karel&rft.date=2024-12-24&rft_id=info:doi/10.48550/arxiv.2412.18566&rft_dat=%3Carxiv_GOX%3E2412_18566%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true