USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder

Improving end-to-end speech recognition by incorporating external text data has been a longstanding research topic. There has been a recent focus on training E2E ASR models that get the performance benefits of external text data without incurring the extra cost of evaluating an external language mod...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2022-02
Hauptverfasser:	Bolaji Yusuf, Gandhe, Ankur, Sokolov, Alex
Format:	Artikel
Sprache:	eng
Schlagworte:	Coders Encoders-Decoders Inference Machine translation Speech recognition Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Bolaji Yusuf Gandhe, Ankur Sokolov, Alex
description	Improving end-to-end speech recognition by incorporating external text data has been a longstanding research topic. There has been a recent focus on training E2E ASR models that get the performance benefits of external text data without incurring the extra cost of evaluating an external language model at inference time. In this work, we propose training ASR model jointly with a set of text-to-text auxiliary tasks with which it shares a decoder and parts of the encoder. When we jointly train ASR and masked language model with the 960-hour Librispeech and Opensubtitles data respectively, we observe WER reductions of 16% and 20% on test-other and test-clean respectively over an ASR-only baseline without any extra cost at inference time, and reductions of 6% and 8% compared to a stronger MUTE-L baseline which trains the decoder with the same text data as our model. We achieve further improvements when we train masked language model on Librispeech data or when we use machine translation as the auxiliary task, without significantly sacrificing performance on the task itself.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2628910973</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2628910973</sourcerecordid><originalsourceid>FETCH-proquest_journals_26289109733</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mRwDg0OcXWxUvDMLSjKL8vMS1dwDA5SKM8syVBIVAjNy0zLTE1RCC5ITU0GCuSlKISkVpQouOYl56ekFum6pIJpHgbWtMSc4lReKM3NoOzmGuLsoQs0srA0tbgkPiu_tCgPKBVvZGZkYWloYGlubEycKgD-Mjha</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2628910973</pqid></control><display><type>article</type><title>USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder</title><source>Free E- Journals</source><creator>Bolaji Yusuf ; Gandhe, Ankur ; Sokolov, Alex</creator><creatorcontrib>Bolaji Yusuf ; Gandhe, Ankur ; Sokolov, Alex</creatorcontrib><description>Improving end-to-end speech recognition by incorporating external text data has been a longstanding research topic. There has been a recent focus on training E2E ASR models that get the performance benefits of external text data without incurring the extra cost of evaluating an external language model at inference time. In this work, we propose training ASR model jointly with a set of text-to-text auxiliary tasks with which it shares a decoder and parts of the encoder. When we jointly train ASR and masked language model with the 960-hour Librispeech and Opensubtitles data respectively, we observe WER reductions of 16% and 20% on test-other and test-clean respectively over an ASR-only baseline without any extra cost at inference time, and reductions of 6% and 8% compared to a stronger MUTE-L baseline which trains the decoder with the same text data as our model. We achieve further improvements when we train masked language model on Librispeech data or when we use machine translation as the auxiliary task, without significantly sacrificing performance on the task itself.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Coders ; Encoders-Decoders ; Inference ; Machine translation ; Speech recognition ; Training</subject><ispartof>arXiv.org, 2022-02</ispartof><rights>2022. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>777,781</link.rule.ids></links><search><creatorcontrib>Bolaji Yusuf</creatorcontrib><creatorcontrib>Gandhe, Ankur</creatorcontrib><creatorcontrib>Sokolov, Alex</creatorcontrib><title>USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder</title><title>arXiv.org</title><description>Improving end-to-end speech recognition by incorporating external text data has been a longstanding research topic. There has been a recent focus on training E2E ASR models that get the performance benefits of external text data without incurring the extra cost of evaluating an external language model at inference time. In this work, we propose training ASR model jointly with a set of text-to-text auxiliary tasks with which it shares a decoder and parts of the encoder. When we jointly train ASR and masked language model with the 960-hour Librispeech and Opensubtitles data respectively, we observe WER reductions of 16% and 20% on test-other and test-clean respectively over an ASR-only baseline without any extra cost at inference time, and reductions of 6% and 8% compared to a stronger MUTE-L baseline which trains the decoder with the same text data as our model. We achieve further improvements when we train masked language model on Librispeech data or when we use machine translation as the auxiliary task, without significantly sacrificing performance on the task itself.</description><subject>Coders</subject><subject>Encoders-Decoders</subject><subject>Inference</subject><subject>Machine translation</subject><subject>Speech recognition</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mRwDg0OcXWxUvDMLSjKL8vMS1dwDA5SKM8syVBIVAjNy0zLTE1RCC5ITU0GCuSlKISkVpQouOYl56ekFum6pIJpHgbWtMSc4lReKM3NoOzmGuLsoQs0srA0tbgkPiu_tCgPKBVvZGZkYWloYGlubEycKgD-Mjha</recordid><startdate>20220212</startdate><enddate>20220212</enddate><creator>Bolaji Yusuf</creator><creator>Gandhe, Ankur</creator><creator>Sokolov, Alex</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20220212</creationdate><title>USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder</title><author>Bolaji Yusuf ; Gandhe, Ankur ; Sokolov, Alex</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_26289109733</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Coders</topic><topic>Encoders-Decoders</topic><topic>Inference</topic><topic>Machine translation</topic><topic>Speech recognition</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Bolaji Yusuf</creatorcontrib><creatorcontrib>Gandhe, Ankur</creatorcontrib><creatorcontrib>Sokolov, Alex</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Bolaji Yusuf</au><au>Gandhe, Ankur</au><au>Sokolov, Alex</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder</atitle><jtitle>arXiv.org</jtitle><date>2022-02-12</date><risdate>2022</risdate><eissn>2331-8422</eissn><abstract>Improving end-to-end speech recognition by incorporating external text data has been a longstanding research topic. There has been a recent focus on training E2E ASR models that get the performance benefits of external text data without incurring the extra cost of evaluating an external language model at inference time. In this work, we propose training ASR model jointly with a set of text-to-text auxiliary tasks with which it shares a decoder and parts of the encoder. When we jointly train ASR and masked language model with the 960-hour Librispeech and Opensubtitles data respectively, we observe WER reductions of 16% and 20% on test-other and test-clean respectively over an ASR-only baseline without any extra cost at inference time, and reductions of 6% and 8% compared to a stronger MUTE-L baseline which trains the decoder with the same text data as our model. We achieve further improvements when we train masked language model on Librispeech data or when we use machine translation as the auxiliary task, without significantly sacrificing performance on the task itself.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2022-02
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2628910973
source	Free E- Journals
subjects	Coders Encoders-Decoders Inference Machine translation Speech recognition Training
title	USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T05%3A50%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=USTED:%20Improving%20ASR%20with%20a%20Unified%20Speech%20and%20Text%20Encoder-Decoder&rft.jtitle=arXiv.org&rft.au=Bolaji%20Yusuf&rft.date=2022-02-12&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2628910973%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2628910973&rft_id=info:pmid/&rfr_iscdi=true