mSLAM: Massively multilingual joint pre-training for speech and text

We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. mSLAM combines w2v-BERT pre-training on speech with SpanBERT pre-training o...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Bapna, Ankur, Cherry, Colin, Zhang, Yu, Jia, Ye, Johnson, Melvin, Cheng, Yong, Khanuja, Simran, Riesa, Jason, Conneau, Alexis
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Bapna, Ankur
Cherry, Colin
Zhang, Yu
Jia, Ye
Johnson, Melvin
Cheng, Yong
Khanuja, Simran
Riesa, Jason
Conneau, Alexis
description We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. mSLAM combines w2v-BERT pre-training on speech with SpanBERT pre-training on character-level text, along with Connectionist Temporal Classification (CTC) losses on paired speech and transcript data, to learn a single model capable of learning from and representing both speech and text signals in a shared representation space. We evaluate mSLAM on several downstream speech understanding tasks and find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID while being competitive on multilingual ASR, when compared against speech-only pre-training. Our speech translation model demonstrates zero-shot text translation without seeing any text translation data, providing evidence for cross-modal alignment of representations. mSLAM also benefits from multi-modal fine-tuning, further improving the quality of speech translation by directly leveraging text translation data during the fine-tuning process. Our empirical analysis highlights several opportunities and challenges arising from large-scale multimodal pre-training, suggesting directions for future research.
doi_str_mv 10.48550/arxiv.2202.01374
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2202_01374</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2202_01374</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-81b0ba59da2bcc8668ec249f74b8854ca17604faa92ec93ba048d64cb3fbc4733</originalsourceid><addsrcrecordid>eNotz71OwzAYhWEvDKhwAUz1DSQ49hfbYavKr5SKge7RZ8cuRk4aOW7V3j1QmI70Dkd6CLmrWAm6rtk9plM4lpwzXrJKKLgmj8NHu9o80A3Oczi6eKbDIeYQw7g7YKRf-zBmOiVX5IRh_KnU7xOdJ-fsJ8Wxp9md8g258hhnd_u_C7J9ftquX4v2_eVtvWoLlAoKXRlmsG565MZaLaV2lkPjFRita7BYKcnAIzbc2UYYZKB7CdYIbywoIRZk-Xd7YXRTCgOmc_fL6S4c8Q1g1kWb</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>mSLAM: Massively multilingual joint pre-training for speech and text</title><source>arXiv.org</source><creator>Bapna, Ankur ; Cherry, Colin ; Zhang, Yu ; Jia, Ye ; Johnson, Melvin ; Cheng, Yong ; Khanuja, Simran ; Riesa, Jason ; Conneau, Alexis</creator><creatorcontrib>Bapna, Ankur ; Cherry, Colin ; Zhang, Yu ; Jia, Ye ; Johnson, Melvin ; Cheng, Yong ; Khanuja, Simran ; Riesa, Jason ; Conneau, Alexis</creatorcontrib><description>We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. mSLAM combines w2v-BERT pre-training on speech with SpanBERT pre-training on character-level text, along with Connectionist Temporal Classification (CTC) losses on paired speech and transcript data, to learn a single model capable of learning from and representing both speech and text signals in a shared representation space. We evaluate mSLAM on several downstream speech understanding tasks and find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID while being competitive on multilingual ASR, when compared against speech-only pre-training. Our speech translation model demonstrates zero-shot text translation without seeing any text translation data, providing evidence for cross-modal alignment of representations. mSLAM also benefits from multi-modal fine-tuning, further improving the quality of speech translation by directly leveraging text translation data during the fine-tuning process. Our empirical analysis highlights several opportunities and challenges arising from large-scale multimodal pre-training, suggesting directions for future research.</description><identifier>DOI: 10.48550/arxiv.2202.01374</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Learning</subject><creationdate>2022-02</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2202.01374$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2202.01374$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Bapna, Ankur</creatorcontrib><creatorcontrib>Cherry, Colin</creatorcontrib><creatorcontrib>Zhang, Yu</creatorcontrib><creatorcontrib>Jia, Ye</creatorcontrib><creatorcontrib>Johnson, Melvin</creatorcontrib><creatorcontrib>Cheng, Yong</creatorcontrib><creatorcontrib>Khanuja, Simran</creatorcontrib><creatorcontrib>Riesa, Jason</creatorcontrib><creatorcontrib>Conneau, Alexis</creatorcontrib><title>mSLAM: Massively multilingual joint pre-training for speech and text</title><description>We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. mSLAM combines w2v-BERT pre-training on speech with SpanBERT pre-training on character-level text, along with Connectionist Temporal Classification (CTC) losses on paired speech and transcript data, to learn a single model capable of learning from and representing both speech and text signals in a shared representation space. We evaluate mSLAM on several downstream speech understanding tasks and find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID while being competitive on multilingual ASR, when compared against speech-only pre-training. Our speech translation model demonstrates zero-shot text translation without seeing any text translation data, providing evidence for cross-modal alignment of representations. mSLAM also benefits from multi-modal fine-tuning, further improving the quality of speech translation by directly leveraging text translation data during the fine-tuning process. Our empirical analysis highlights several opportunities and challenges arising from large-scale multimodal pre-training, suggesting directions for future research.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz71OwzAYhWEvDKhwAUz1DSQ49hfbYavKr5SKge7RZ8cuRk4aOW7V3j1QmI70Dkd6CLmrWAm6rtk9plM4lpwzXrJKKLgmj8NHu9o80A3Oczi6eKbDIeYQw7g7YKRf-zBmOiVX5IRh_KnU7xOdJ-fsJ8Wxp9md8g258hhnd_u_C7J9ftquX4v2_eVtvWoLlAoKXRlmsG565MZaLaV2lkPjFRita7BYKcnAIzbc2UYYZKB7CdYIbywoIRZk-Xd7YXRTCgOmc_fL6S4c8Q1g1kWb</recordid><startdate>20220202</startdate><enddate>20220202</enddate><creator>Bapna, Ankur</creator><creator>Cherry, Colin</creator><creator>Zhang, Yu</creator><creator>Jia, Ye</creator><creator>Johnson, Melvin</creator><creator>Cheng, Yong</creator><creator>Khanuja, Simran</creator><creator>Riesa, Jason</creator><creator>Conneau, Alexis</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220202</creationdate><title>mSLAM: Massively multilingual joint pre-training for speech and text</title><author>Bapna, Ankur ; Cherry, Colin ; Zhang, Yu ; Jia, Ye ; Johnson, Melvin ; Cheng, Yong ; Khanuja, Simran ; Riesa, Jason ; Conneau, Alexis</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-81b0ba59da2bcc8668ec249f74b8854ca17604faa92ec93ba048d64cb3fbc4733</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Bapna, Ankur</creatorcontrib><creatorcontrib>Cherry, Colin</creatorcontrib><creatorcontrib>Zhang, Yu</creatorcontrib><creatorcontrib>Jia, Ye</creatorcontrib><creatorcontrib>Johnson, Melvin</creatorcontrib><creatorcontrib>Cheng, Yong</creatorcontrib><creatorcontrib>Khanuja, Simran</creatorcontrib><creatorcontrib>Riesa, Jason</creatorcontrib><creatorcontrib>Conneau, Alexis</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Bapna, Ankur</au><au>Cherry, Colin</au><au>Zhang, Yu</au><au>Jia, Ye</au><au>Johnson, Melvin</au><au>Cheng, Yong</au><au>Khanuja, Simran</au><au>Riesa, Jason</au><au>Conneau, Alexis</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>mSLAM: Massively multilingual joint pre-training for speech and text</atitle><date>2022-02-02</date><risdate>2022</risdate><abstract>We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages. mSLAM combines w2v-BERT pre-training on speech with SpanBERT pre-training on character-level text, along with Connectionist Temporal Classification (CTC) losses on paired speech and transcript data, to learn a single model capable of learning from and representing both speech and text signals in a shared representation space. We evaluate mSLAM on several downstream speech understanding tasks and find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID while being competitive on multilingual ASR, when compared against speech-only pre-training. Our speech translation model demonstrates zero-shot text translation without seeing any text translation data, providing evidence for cross-modal alignment of representations. mSLAM also benefits from multi-modal fine-tuning, further improving the quality of speech translation by directly leveraging text translation data during the fine-tuning process. Our empirical analysis highlights several opportunities and challenges arising from large-scale multimodal pre-training, suggesting directions for future research.</abstract><doi>10.48550/arxiv.2202.01374</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2202.01374
ispartof
issn
language eng
recordid cdi_arxiv_primary_2202_01374
source arXiv.org
subjects Computer Science - Computation and Language
Computer Science - Learning
title mSLAM: Massively multilingual joint pre-training for speech and text
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T14%3A41%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=mSLAM:%20Massively%20multilingual%20joint%20pre-training%20for%20speech%20and%20text&rft.au=Bapna,%20Ankur&rft.date=2022-02-02&rft_id=info:doi/10.48550/arxiv.2202.01374&rft_dat=%3Carxiv_GOX%3E2202_01374%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true