Lattice-based lightly-supervised acoustic model training

In the broadcast domain there is an abundance of related text data and partial transcriptions, such as closed captions and subtitles. This text data can be used for lightly supervised training, in which text matching the audio is selected using an existing speech recognition model. Current approache...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Fainberg, Joachim, Klejch, Ondřej, Renals, Steve, Bell, Peter
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Sound
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Fainberg, Joachim Klejch, Ondřej Renals, Steve Bell, Peter
description	In the broadcast domain there is an abundance of related text data and partial transcriptions, such as closed captions and subtitles. This text data can be used for lightly supervised training, in which text matching the audio is selected using an existing speech recognition model. Current approaches to light supervision typically filter the data based on matching error rates between the transcriptions and biased decoding hypotheses. In contrast, semi-supervised training does not require matching text data, instead generating a hypothesis using a background language model. State-of-the-art semi-supervised training uses lattice-based supervision with the lattice-free MMI (LF-MMI) objective function. We propose a technique to combine inaccurate transcriptions with the lattices generated for semi-supervised training, thus preserving uncertainty in the lattice where appropriate. We demonstrate that this combined approach reduces the expected error rates over the lattices, and reduces the word error rate (WER) on a broadcast task.
doi_str_mv	10.48550/arxiv.1905.13150
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_1905_13150</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1905_13150</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-ea732983cfbb2483579c539cd8c69855710d5e93a1cc3dded74865ade6d082a53</originalsourceid><addsrcrecordid>eNotj81qwzAQhHXpoaR5gJ7iF5Areb2WdCyhbQqGXnI3a-0mFTg_yE5o3r5J2tPAMAzfp9SzNWXtEc0L5Z90Lm0wWFqwaB6Vb2maUhTd0yhcDGn7PQ0XPZ6Oks_pVlE8nMbrpNgdWIZiypT2ab99Ug8bGkaZ_-dMrd_f1suVbr8-PpevrabGGS3koAoe4qbvq9oDuhARQmQfm3BFctYwSgCyMQKzsKt9g8TSsPEVIczU4u_2jt4dc9pRvnQ3he6uAL_3bUF5</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Lattice-based lightly-supervised acoustic model training</title><source>arXiv.org</source><creator>Fainberg, Joachim ; Klejch, Ondřej ; Renals, Steve ; Bell, Peter</creator><creatorcontrib>Fainberg, Joachim ; Klejch, Ondřej ; Renals, Steve ; Bell, Peter</creatorcontrib><description>In the broadcast domain there is an abundance of related text data and partial transcriptions, such as closed captions and subtitles. This text data can be used for lightly supervised training, in which text matching the audio is selected using an existing speech recognition model. Current approaches to light supervision typically filter the data based on matching error rates between the transcriptions and biased decoding hypotheses. In contrast, semi-supervised training does not require matching text data, instead generating a hypothesis using a background language model. State-of-the-art semi-supervised training uses lattice-based supervision with the lattice-free MMI (LF-MMI) objective function. We propose a technique to combine inaccurate transcriptions with the lattices generated for semi-supervised training, thus preserving uncertainty in the lattice where appropriate. We demonstrate that this combined approach reduces the expected error rates over the lattices, and reduces the word error rate (WER) on a broadcast task.</description><identifier>DOI: 10.48550/arxiv.1905.13150</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Sound</subject><creationdate>2019-05</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/1905.13150$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.1905.13150$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Fainberg, Joachim</creatorcontrib><creatorcontrib>Klejch, Ondřej</creatorcontrib><creatorcontrib>Renals, Steve</creatorcontrib><creatorcontrib>Bell, Peter</creatorcontrib><title>Lattice-based lightly-supervised acoustic model training</title><description>In the broadcast domain there is an abundance of related text data and partial transcriptions, such as closed captions and subtitles. This text data can be used for lightly supervised training, in which text matching the audio is selected using an existing speech recognition model. Current approaches to light supervision typically filter the data based on matching error rates between the transcriptions and biased decoding hypotheses. In contrast, semi-supervised training does not require matching text data, instead generating a hypothesis using a background language model. State-of-the-art semi-supervised training uses lattice-based supervision with the lattice-free MMI (LF-MMI) objective function. We propose a technique to combine inaccurate transcriptions with the lattices generated for semi-supervised training, thus preserving uncertainty in the lattice where appropriate. We demonstrate that this combined approach reduces the expected error rates over the lattices, and reduces the word error rate (WER) on a broadcast task.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81qwzAQhHXpoaR5gJ7iF5Areb2WdCyhbQqGXnI3a-0mFTg_yE5o3r5J2tPAMAzfp9SzNWXtEc0L5Z90Lm0wWFqwaB6Vb2maUhTd0yhcDGn7PQ0XPZ6Oks_pVlE8nMbrpNgdWIZiypT2ab99Ug8bGkaZ_-dMrd_f1suVbr8-PpevrabGGS3koAoe4qbvq9oDuhARQmQfm3BFctYwSgCyMQKzsKt9g8TSsPEVIczU4u_2jt4dc9pRvnQ3he6uAL_3bUF5</recordid><startdate>20190530</startdate><enddate>20190530</enddate><creator>Fainberg, Joachim</creator><creator>Klejch, Ondřej</creator><creator>Renals, Steve</creator><creator>Bell, Peter</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20190530</creationdate><title>Lattice-based lightly-supervised acoustic model training</title><author>Fainberg, Joachim ; Klejch, Ondřej ; Renals, Steve ; Bell, Peter</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-ea732983cfbb2483579c539cd8c69855710d5e93a1cc3dded74865ade6d082a53</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Fainberg, Joachim</creatorcontrib><creatorcontrib>Klejch, Ondřej</creatorcontrib><creatorcontrib>Renals, Steve</creatorcontrib><creatorcontrib>Bell, Peter</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Fainberg, Joachim</au><au>Klejch, Ondřej</au><au>Renals, Steve</au><au>Bell, Peter</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Lattice-based lightly-supervised acoustic model training</atitle><date>2019-05-30</date><risdate>2019</risdate><abstract>In the broadcast domain there is an abundance of related text data and partial transcriptions, such as closed captions and subtitles. This text data can be used for lightly supervised training, in which text matching the audio is selected using an existing speech recognition model. Current approaches to light supervision typically filter the data based on matching error rates between the transcriptions and biased decoding hypotheses. In contrast, semi-supervised training does not require matching text data, instead generating a hypothesis using a background language model. State-of-the-art semi-supervised training uses lattice-based supervision with the lattice-free MMI (LF-MMI) objective function. We propose a technique to combine inaccurate transcriptions with the lattices generated for semi-supervised training, thus preserving uncertainty in the lattice where appropriate. We demonstrate that this combined approach reduces the expected error rates over the lattices, and reduces the word error rate (WER) on a broadcast task.</abstract><doi>10.48550/arxiv.1905.13150</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.1905.13150
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_1905_13150
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Sound
title	Lattice-based lightly-supervised acoustic model training
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T21%3A18%3A34IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Lattice-based%20lightly-supervised%20acoustic%20model%20training&rft.au=Fainberg,%20Joachim&rft.date=2019-05-30&rft_id=info:doi/10.48550/arxiv.1905.13150&rft_dat=%3Carxiv_GOX%3E1905_13150%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true