USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder
Improving end-to-end speech recognition by incorporating external text data has been a longstanding research topic. There has been a recent focus on training E2E ASR models that get the performance benefits of external text data without incurring the extra cost of evaluating an external language mod...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2022-02 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Bolaji Yusuf Gandhe, Ankur Sokolov, Alex |
description | Improving end-to-end speech recognition by incorporating external text data has been a longstanding research topic. There has been a recent focus on training E2E ASR models that get the performance benefits of external text data without incurring the extra cost of evaluating an external language model at inference time. In this work, we propose training ASR model jointly with a set of text-to-text auxiliary tasks with which it shares a decoder and parts of the encoder. When we jointly train ASR and masked language model with the 960-hour Librispeech and Opensubtitles data respectively, we observe WER reductions of 16% and 20% on test-other and test-clean respectively over an ASR-only baseline without any extra cost at inference time, and reductions of 6% and 8% compared to a stronger MUTE-L baseline which trains the decoder with the same text data as our model. We achieve further improvements when we train masked language model on Librispeech data or when we use machine translation as the auxiliary task, without significantly sacrificing performance on the task itself. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2628910973</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2628910973</sourcerecordid><originalsourceid>FETCH-proquest_journals_26289109733</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mRwDg0OcXWxUvDMLSjKL8vMS1dwDA5SKM8syVBIVAjNy0zLTE1RCC5ITU0GCuSlKISkVpQouOYl56ekFum6pIJpHgbWtMSc4lReKM3NoOzmGuLsoQs0srA0tbgkPiu_tCgPKBVvZGZkYWloYGlubEycKgD-Mjha</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2628910973</pqid></control><display><type>article</type><title>USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder</title><source>Free E- Journals</source><creator>Bolaji Yusuf ; Gandhe, Ankur ; Sokolov, Alex</creator><creatorcontrib>Bolaji Yusuf ; Gandhe, Ankur ; Sokolov, Alex</creatorcontrib><description>Improving end-to-end speech recognition by incorporating external text data has been a longstanding research topic. There has been a recent focus on training E2E ASR models that get the performance benefits of external text data without incurring the extra cost of evaluating an external language model at inference time. In this work, we propose training ASR model jointly with a set of text-to-text auxiliary tasks with which it shares a decoder and parts of the encoder. When we jointly train ASR and masked language model with the 960-hour Librispeech and Opensubtitles data respectively, we observe WER reductions of 16% and 20% on test-other and test-clean respectively over an ASR-only baseline without any extra cost at inference time, and reductions of 6% and 8% compared to a stronger MUTE-L baseline which trains the decoder with the same text data as our model. We achieve further improvements when we train masked language model on Librispeech data or when we use machine translation as the auxiliary task, without significantly sacrificing performance on the task itself.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Coders ; Encoders-Decoders ; Inference ; Machine translation ; Speech recognition ; Training</subject><ispartof>arXiv.org, 2022-02</ispartof><rights>2022. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>777,781</link.rule.ids></links><search><creatorcontrib>Bolaji Yusuf</creatorcontrib><creatorcontrib>Gandhe, Ankur</creatorcontrib><creatorcontrib>Sokolov, Alex</creatorcontrib><title>USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder</title><title>arXiv.org</title><description>Improving end-to-end speech recognition by incorporating external text data has been a longstanding research topic. There has been a recent focus on training E2E ASR models that get the performance benefits of external text data without incurring the extra cost of evaluating an external language model at inference time. In this work, we propose training ASR model jointly with a set of text-to-text auxiliary tasks with which it shares a decoder and parts of the encoder. When we jointly train ASR and masked language model with the 960-hour Librispeech and Opensubtitles data respectively, we observe WER reductions of 16% and 20% on test-other and test-clean respectively over an ASR-only baseline without any extra cost at inference time, and reductions of 6% and 8% compared to a stronger MUTE-L baseline which trains the decoder with the same text data as our model. We achieve further improvements when we train masked language model on Librispeech data or when we use machine translation as the auxiliary task, without significantly sacrificing performance on the task itself.</description><subject>Coders</subject><subject>Encoders-Decoders</subject><subject>Inference</subject><subject>Machine translation</subject><subject>Speech recognition</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mRwDg0OcXWxUvDMLSjKL8vMS1dwDA5SKM8syVBIVAjNy0zLTE1RCC5ITU0GCuSlKISkVpQouOYl56ekFum6pIJpHgbWtMSc4lReKM3NoOzmGuLsoQs0srA0tbgkPiu_tCgPKBVvZGZkYWloYGlubEycKgD-Mjha</recordid><startdate>20220212</startdate><enddate>20220212</enddate><creator>Bolaji Yusuf</creator><creator>Gandhe, Ankur</creator><creator>Sokolov, Alex</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20220212</creationdate><title>USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder</title><author>Bolaji Yusuf ; Gandhe, Ankur ; Sokolov, Alex</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_26289109733</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Coders</topic><topic>Encoders-Decoders</topic><topic>Inference</topic><topic>Machine translation</topic><topic>Speech recognition</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Bolaji Yusuf</creatorcontrib><creatorcontrib>Gandhe, Ankur</creatorcontrib><creatorcontrib>Sokolov, Alex</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Bolaji Yusuf</au><au>Gandhe, Ankur</au><au>Sokolov, Alex</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder</atitle><jtitle>arXiv.org</jtitle><date>2022-02-12</date><risdate>2022</risdate><eissn>2331-8422</eissn><abstract>Improving end-to-end speech recognition by incorporating external text data has been a longstanding research topic. There has been a recent focus on training E2E ASR models that get the performance benefits of external text data without incurring the extra cost of evaluating an external language model at inference time. In this work, we propose training ASR model jointly with a set of text-to-text auxiliary tasks with which it shares a decoder and parts of the encoder. When we jointly train ASR and masked language model with the 960-hour Librispeech and Opensubtitles data respectively, we observe WER reductions of 16% and 20% on test-other and test-clean respectively over an ASR-only baseline without any extra cost at inference time, and reductions of 6% and 8% compared to a stronger MUTE-L baseline which trains the decoder with the same text data as our model. We achieve further improvements when we train masked language model on Librispeech data or when we use machine translation as the auxiliary task, without significantly sacrificing performance on the task itself.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2022-02 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2628910973 |
source | Free E- Journals |
subjects | Coders Encoders-Decoders Inference Machine translation Speech recognition Training |
title | USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-21T05%3A50%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=USTED:%20Improving%20ASR%20with%20a%20Unified%20Speech%20and%20Text%20Encoder-Decoder&rft.jtitle=arXiv.org&rft.au=Bolaji%20Yusuf&rft.date=2022-02-12&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2628910973%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2628910973&rft_id=info:pmid/&rfr_iscdi=true |