Lattice-based lightly-supervised acoustic model training

In the broadcast domain there is an abundance of related text data and partial transcriptions, such as closed captions and subtitles. This text data can be used for lightly supervised training, in which text matching the audio is selected using an existing speech recognition model. Current approache...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Fainberg, Joachim, Klejch, Ondřej, Renals, Steve, Bell, Peter
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Fainberg, Joachim
Klejch, Ondřej
Renals, Steve
Bell, Peter
description In the broadcast domain there is an abundance of related text data and partial transcriptions, such as closed captions and subtitles. This text data can be used for lightly supervised training, in which text matching the audio is selected using an existing speech recognition model. Current approaches to light supervision typically filter the data based on matching error rates between the transcriptions and biased decoding hypotheses. In contrast, semi-supervised training does not require matching text data, instead generating a hypothesis using a background language model. State-of-the-art semi-supervised training uses lattice-based supervision with the lattice-free MMI (LF-MMI) objective function. We propose a technique to combine inaccurate transcriptions with the lattices generated for semi-supervised training, thus preserving uncertainty in the lattice where appropriate. We demonstrate that this combined approach reduces the expected error rates over the lattices, and reduces the word error rate (WER) on a broadcast task.
doi_str_mv 10.48550/arxiv.1905.13150
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_1905_13150</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1905_13150</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-ea732983cfbb2483579c539cd8c69855710d5e93a1cc3dded74865ade6d082a53</originalsourceid><addsrcrecordid>eNotj81qwzAQhHXpoaR5gJ7iF5Areb2WdCyhbQqGXnI3a-0mFTg_yE5o3r5J2tPAMAzfp9SzNWXtEc0L5Z90Lm0wWFqwaB6Vb2maUhTd0yhcDGn7PQ0XPZ6Oks_pVlE8nMbrpNgdWIZiypT2ab99Ug8bGkaZ_-dMrd_f1suVbr8-PpevrabGGS3koAoe4qbvq9oDuhARQmQfm3BFctYwSgCyMQKzsKt9g8TSsPEVIczU4u_2jt4dc9pRvnQ3he6uAL_3bUF5</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Lattice-based lightly-supervised acoustic model training</title><source>arXiv.org</source><creator>Fainberg, Joachim ; Klejch, Ondřej ; Renals, Steve ; Bell, Peter</creator><creatorcontrib>Fainberg, Joachim ; Klejch, Ondřej ; Renals, Steve ; Bell, Peter</creatorcontrib><description>In the broadcast domain there is an abundance of related text data and partial transcriptions, such as closed captions and subtitles. This text data can be used for lightly supervised training, in which text matching the audio is selected using an existing speech recognition model. Current approaches to light supervision typically filter the data based on matching error rates between the transcriptions and biased decoding hypotheses. In contrast, semi-supervised training does not require matching text data, instead generating a hypothesis using a background language model. State-of-the-art semi-supervised training uses lattice-based supervision with the lattice-free MMI (LF-MMI) objective function. We propose a technique to combine inaccurate transcriptions with the lattices generated for semi-supervised training, thus preserving uncertainty in the lattice where appropriate. We demonstrate that this combined approach reduces the expected error rates over the lattices, and reduces the word error rate (WER) on a broadcast task.</description><identifier>DOI: 10.48550/arxiv.1905.13150</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Sound</subject><creationdate>2019-05</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/1905.13150$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.1905.13150$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Fainberg, Joachim</creatorcontrib><creatorcontrib>Klejch, Ondřej</creatorcontrib><creatorcontrib>Renals, Steve</creatorcontrib><creatorcontrib>Bell, Peter</creatorcontrib><title>Lattice-based lightly-supervised acoustic model training</title><description>In the broadcast domain there is an abundance of related text data and partial transcriptions, such as closed captions and subtitles. This text data can be used for lightly supervised training, in which text matching the audio is selected using an existing speech recognition model. Current approaches to light supervision typically filter the data based on matching error rates between the transcriptions and biased decoding hypotheses. In contrast, semi-supervised training does not require matching text data, instead generating a hypothesis using a background language model. State-of-the-art semi-supervised training uses lattice-based supervision with the lattice-free MMI (LF-MMI) objective function. We propose a technique to combine inaccurate transcriptions with the lattices generated for semi-supervised training, thus preserving uncertainty in the lattice where appropriate. We demonstrate that this combined approach reduces the expected error rates over the lattices, and reduces the word error rate (WER) on a broadcast task.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81qwzAQhHXpoaR5gJ7iF5Areb2WdCyhbQqGXnI3a-0mFTg_yE5o3r5J2tPAMAzfp9SzNWXtEc0L5Z90Lm0wWFqwaB6Vb2maUhTd0yhcDGn7PQ0XPZ6Oks_pVlE8nMbrpNgdWIZiypT2ab99Ug8bGkaZ_-dMrd_f1suVbr8-PpevrabGGS3koAoe4qbvq9oDuhARQmQfm3BFctYwSgCyMQKzsKt9g8TSsPEVIczU4u_2jt4dc9pRvnQ3he6uAL_3bUF5</recordid><startdate>20190530</startdate><enddate>20190530</enddate><creator>Fainberg, Joachim</creator><creator>Klejch, Ondřej</creator><creator>Renals, Steve</creator><creator>Bell, Peter</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20190530</creationdate><title>Lattice-based lightly-supervised acoustic model training</title><author>Fainberg, Joachim ; Klejch, Ondřej ; Renals, Steve ; Bell, Peter</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-ea732983cfbb2483579c539cd8c69855710d5e93a1cc3dded74865ade6d082a53</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Fainberg, Joachim</creatorcontrib><creatorcontrib>Klejch, Ondřej</creatorcontrib><creatorcontrib>Renals, Steve</creatorcontrib><creatorcontrib>Bell, Peter</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Fainberg, Joachim</au><au>Klejch, Ondřej</au><au>Renals, Steve</au><au>Bell, Peter</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Lattice-based lightly-supervised acoustic model training</atitle><date>2019-05-30</date><risdate>2019</risdate><abstract>In the broadcast domain there is an abundance of related text data and partial transcriptions, such as closed captions and subtitles. This text data can be used for lightly supervised training, in which text matching the audio is selected using an existing speech recognition model. Current approaches to light supervision typically filter the data based on matching error rates between the transcriptions and biased decoding hypotheses. In contrast, semi-supervised training does not require matching text data, instead generating a hypothesis using a background language model. State-of-the-art semi-supervised training uses lattice-based supervision with the lattice-free MMI (LF-MMI) objective function. We propose a technique to combine inaccurate transcriptions with the lattices generated for semi-supervised training, thus preserving uncertainty in the lattice where appropriate. We demonstrate that this combined approach reduces the expected error rates over the lattices, and reduces the word error rate (WER) on a broadcast task.</abstract><doi>10.48550/arxiv.1905.13150</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.1905.13150
ispartof
issn
language eng
recordid cdi_arxiv_primary_1905_13150
source arXiv.org
subjects Computer Science - Computation and Language
Computer Science - Sound
title Lattice-based lightly-supervised acoustic model training
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T21%3A18%3A34IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Lattice-based%20lightly-supervised%20acoustic%20model%20training&rft.au=Fainberg,%20Joachim&rft.date=2019-05-30&rft_id=info:doi/10.48550/arxiv.1905.13150&rft_dat=%3Carxiv_GOX%3E1905_13150%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true