Mask The Bias: Improving Domain-Adaptive Generalization of CTC-based ASR with Internal Language Model Estimation

End-to-end ASR models trained on large amount of data tend to be implicitly biased towards language semantics of the training data. Internal language model estimation (ILME) has been proposed to mitigate this bias for autoregressive models such as attention-based encoder-decoder and RNN-T. Typically...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Das, Nilaksh, Sunkara, Monica, Bodapati, Sravan, Cai, Jinglun, Kulshreshtha, Devang, Farris, Jeff, Kirchhoff, Katrin
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning Computer Science - Sound
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Das, Nilaksh Sunkara, Monica Bodapati, Sravan Cai, Jinglun Kulshreshtha, Devang Farris, Jeff Kirchhoff, Katrin
description	End-to-end ASR models trained on large amount of data tend to be implicitly biased towards language semantics of the training data. Internal language model estimation (ILME) has been proposed to mitigate this bias for autoregressive models such as attention-based encoder-decoder and RNN-T. Typically, ILME is performed by modularizing the acoustic and language components of the model architecture, and eliminating the acoustic input to perform log-linear interpolation with the text-only posterior. However, for CTC-based ASR, it is not as straightforward to decouple the model into such acoustic and language components, as CTC log-posteriors are computed in a non-autoregressive manner. In this work, we propose a novel ILME technique for CTC-based ASR models. Our method iteratively masks the audio timesteps to estimate a pseudo log-likelihood of the internal LM by accumulating log-posteriors for only the masked timesteps. Extensive evaluation across multiple out-of-domain datasets reveals that the proposed approach improves WER by up to 9.8% and OOV F1-score by up to 24.6% relative to Shallow Fusion, when only text data from target domain is available. In the case of zero-shot domain adaptation, with no access to any target domain data, we demonstrate that removing the source domain bias with ILME can still outperform Shallow Fusion to improve WER by up to 9.3% relative.
doi_str_mv	10.48550/arxiv.2305.03837
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2305_03837</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2305_03837</sourcerecordid><originalsourceid>FETCH-LOGICAL-a677-ca556865253afb7371d17d6a8a5f685dc7a275bbdce9c5eb36e215bb5232ea933</originalsourceid><addsrcrecordid>eNotj81OhDAUhbtxYUYfwJX3BUCgtmXcIY4jCRMTZU8u9MI0QiEF8efpHdHVyVmc7-Rj7CoM_NtYiOAG3adZ_IgHwg94zNU5Gw84vUFxJLg3ON1B1o9uWIxt4WHo0Vgv0TjOZiHYkyWHnfnG2QwWhgbSIvUqnEhD8voCH2Y-QmZnchY7yNG279gSHAZNHeym2fTr8IKdNdhNdPmfG1Y87or0ycuf91ma5B5KpbwahZCxFJHg2FSKq1CHSkuMUTQyFrpWGClRVbqmbS2o4pKi8NRFxCPCLecbdv2HXZXL0Z3u3Vf5q16u6vwH5_RUeQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Mask The Bias: Improving Domain-Adaptive Generalization of CTC-based ASR with Internal Language Model Estimation</title><source>arXiv.org</source><creator>Das, Nilaksh ; Sunkara, Monica ; Bodapati, Sravan ; Cai, Jinglun ; Kulshreshtha, Devang ; Farris, Jeff ; Kirchhoff, Katrin</creator><creatorcontrib>Das, Nilaksh ; Sunkara, Monica ; Bodapati, Sravan ; Cai, Jinglun ; Kulshreshtha, Devang ; Farris, Jeff ; Kirchhoff, Katrin</creatorcontrib><description>End-to-end ASR models trained on large amount of data tend to be implicitly biased towards language semantics of the training data. Internal language model estimation (ILME) has been proposed to mitigate this bias for autoregressive models such as attention-based encoder-decoder and RNN-T. Typically, ILME is performed by modularizing the acoustic and language components of the model architecture, and eliminating the acoustic input to perform log-linear interpolation with the text-only posterior. However, for CTC-based ASR, it is not as straightforward to decouple the model into such acoustic and language components, as CTC log-posteriors are computed in a non-autoregressive manner. In this work, we propose a novel ILME technique for CTC-based ASR models. Our method iteratively masks the audio timesteps to estimate a pseudo log-likelihood of the internal LM by accumulating log-posteriors for only the masked timesteps. Extensive evaluation across multiple out-of-domain datasets reveals that the proposed approach improves WER by up to 9.8% and OOV F1-score by up to 24.6% relative to Shallow Fusion, when only text data from target domain is available. In the case of zero-shot domain adaptation, with no access to any target domain data, we demonstrate that removing the source domain bias with ILME can still outperform Shallow Fusion to improve WER by up to 9.3% relative.</description><identifier>DOI: 10.48550/arxiv.2305.03837</identifier><language>eng</language><subject>Computer Science - Learning ; Computer Science - Sound</subject><creationdate>2023-05</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2305.03837$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2305.03837$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Das, Nilaksh</creatorcontrib><creatorcontrib>Sunkara, Monica</creatorcontrib><creatorcontrib>Bodapati, Sravan</creatorcontrib><creatorcontrib>Cai, Jinglun</creatorcontrib><creatorcontrib>Kulshreshtha, Devang</creatorcontrib><creatorcontrib>Farris, Jeff</creatorcontrib><creatorcontrib>Kirchhoff, Katrin</creatorcontrib><title>Mask The Bias: Improving Domain-Adaptive Generalization of CTC-based ASR with Internal Language Model Estimation</title><description>End-to-end ASR models trained on large amount of data tend to be implicitly biased towards language semantics of the training data. Internal language model estimation (ILME) has been proposed to mitigate this bias for autoregressive models such as attention-based encoder-decoder and RNN-T. Typically, ILME is performed by modularizing the acoustic and language components of the model architecture, and eliminating the acoustic input to perform log-linear interpolation with the text-only posterior. However, for CTC-based ASR, it is not as straightforward to decouple the model into such acoustic and language components, as CTC log-posteriors are computed in a non-autoregressive manner. In this work, we propose a novel ILME technique for CTC-based ASR models. Our method iteratively masks the audio timesteps to estimate a pseudo log-likelihood of the internal LM by accumulating log-posteriors for only the masked timesteps. Extensive evaluation across multiple out-of-domain datasets reveals that the proposed approach improves WER by up to 9.8% and OOV F1-score by up to 24.6% relative to Shallow Fusion, when only text data from target domain is available. In the case of zero-shot domain adaptation, with no access to any target domain data, we demonstrate that removing the source domain bias with ILME can still outperform Shallow Fusion to improve WER by up to 9.3% relative.</description><subject>Computer Science - Learning</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81OhDAUhbtxYUYfwJX3BUCgtmXcIY4jCRMTZU8u9MI0QiEF8efpHdHVyVmc7-Rj7CoM_NtYiOAG3adZ_IgHwg94zNU5Gw84vUFxJLg3ON1B1o9uWIxt4WHo0Vgv0TjOZiHYkyWHnfnG2QwWhgbSIvUqnEhD8voCH2Y-QmZnchY7yNG279gSHAZNHeym2fTr8IKdNdhNdPmfG1Y87or0ycuf91ma5B5KpbwahZCxFJHg2FSKq1CHSkuMUTQyFrpWGClRVbqmbS2o4pKi8NRFxCPCLecbdv2HXZXL0Z3u3Vf5q16u6vwH5_RUeQ</recordid><startdate>20230505</startdate><enddate>20230505</enddate><creator>Das, Nilaksh</creator><creator>Sunkara, Monica</creator><creator>Bodapati, Sravan</creator><creator>Cai, Jinglun</creator><creator>Kulshreshtha, Devang</creator><creator>Farris, Jeff</creator><creator>Kirchhoff, Katrin</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230505</creationdate><title>Mask The Bias: Improving Domain-Adaptive Generalization of CTC-based ASR with Internal Language Model Estimation</title><author>Das, Nilaksh ; Sunkara, Monica ; Bodapati, Sravan ; Cai, Jinglun ; Kulshreshtha, Devang ; Farris, Jeff ; Kirchhoff, Katrin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a677-ca556865253afb7371d17d6a8a5f685dc7a275bbdce9c5eb36e215bb5232ea933</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Learning</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Das, Nilaksh</creatorcontrib><creatorcontrib>Sunkara, Monica</creatorcontrib><creatorcontrib>Bodapati, Sravan</creatorcontrib><creatorcontrib>Cai, Jinglun</creatorcontrib><creatorcontrib>Kulshreshtha, Devang</creatorcontrib><creatorcontrib>Farris, Jeff</creatorcontrib><creatorcontrib>Kirchhoff, Katrin</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Das, Nilaksh</au><au>Sunkara, Monica</au><au>Bodapati, Sravan</au><au>Cai, Jinglun</au><au>Kulshreshtha, Devang</au><au>Farris, Jeff</au><au>Kirchhoff, Katrin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Mask The Bias: Improving Domain-Adaptive Generalization of CTC-based ASR with Internal Language Model Estimation</atitle><date>2023-05-05</date><risdate>2023</risdate><abstract>End-to-end ASR models trained on large amount of data tend to be implicitly biased towards language semantics of the training data. Internal language model estimation (ILME) has been proposed to mitigate this bias for autoregressive models such as attention-based encoder-decoder and RNN-T. Typically, ILME is performed by modularizing the acoustic and language components of the model architecture, and eliminating the acoustic input to perform log-linear interpolation with the text-only posterior. However, for CTC-based ASR, it is not as straightforward to decouple the model into such acoustic and language components, as CTC log-posteriors are computed in a non-autoregressive manner. In this work, we propose a novel ILME technique for CTC-based ASR models. Our method iteratively masks the audio timesteps to estimate a pseudo log-likelihood of the internal LM by accumulating log-posteriors for only the masked timesteps. Extensive evaluation across multiple out-of-domain datasets reveals that the proposed approach improves WER by up to 9.8% and OOV F1-score by up to 24.6% relative to Shallow Fusion, when only text data from target domain is available. In the case of zero-shot domain adaptation, with no access to any target domain data, we demonstrate that removing the source domain bias with ILME can still outperform Shallow Fusion to improve WER by up to 9.3% relative.</abstract><doi>10.48550/arxiv.2305.03837</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2305.03837
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2305_03837
source	arXiv.org
subjects	Computer Science - Learning Computer Science - Sound
title	Mask The Bias: Improving Domain-Adaptive Generalization of CTC-based ASR with Internal Language Model Estimation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-12T23%3A58%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Mask%20The%20Bias:%20Improving%20Domain-Adaptive%20Generalization%20of%20CTC-based%20ASR%20with%20Internal%20Language%20Model%20Estimation&rft.au=Das,%20Nilaksh&rft.date=2023-05-05&rft_id=info:doi/10.48550/arxiv.2305.03837&rft_dat=%3Carxiv_GOX%3E2305_03837%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true