Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System

The performances of automatic speech recognition (ASR) systems are usually evaluated by the metric word error rate (WER) when the manually transcribed data are provided, which are, however, expensively available in the real scenario. In addition, the empirical distribution of WER for most ASR system...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Fan, Kai, Wang, Jiayi, Li, Bo, Zhang, Shiliang, Chen, Boxing, Ge, Niyu, Yan, Zhijie
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Sound
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Fan, Kai Wang, Jiayi Li, Bo Zhang, Shiliang Chen, Boxing Ge, Niyu Yan, Zhijie
description	The performances of automatic speech recognition (ASR) systems are usually evaluated by the metric word error rate (WER) when the manually transcribed data are provided, which are, however, expensively available in the real scenario. In addition, the empirical distribution of WER for most ASR systems usually tends to put a significant mass near zero, making it difficult to simulate with a single continuous distribution. In order to address the two issues of ASR quality estimation (QE), we propose a novel neural zero-inflated model to predict the WER of the ASR result without transcripts. We design a neural zero-inflated beta regression on top of a bidirectional transformer language model conditional on speech features (speech-BERT). We adopt the pre-training strategy of token level mask language modeling for speech-BERT as well, and further fine-tune with our zero-inflated layer for the mixture of discrete and continuous outputs. The experimental results show that our approach achieves better performance on WER prediction in the metrics of Pearson and MAE, compared with most existed quality estimation algorithms for ASR or machine translation.
doi_str_mv	10.48550/arxiv.1910.01289
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_1910_01289</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1910_01289</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-a15d825e753a52c66b67c1ef9f439b57db8bffde39146d03bbd71b4c9ef5867f3</originalsourceid><addsrcrecordid>eNotj8tOwzAURL1hgQofwAr_QEocx69lVbVQqRRBy4ZN5Mc1RHLjynEQ-XvawGqko9FoDkJ3pJzXkrHyQaef9ntO1BmUpJLqGr3vYEg64A9Isdh0PugMDr8OOrR5xKs-t0ed29jh5-gg4HVMeDHkeIEW708A9gu_gY2fXTvV9mOf4XiDrrwOPdz-5wwd1qvD8qnYvjxulottoblQhSbMyYqBYFSzynJuuLAEvPI1VYYJZ6Tx3gFVpOaupMY4QUxtFXgmufB0hu7_Ziev5pTOZ9PYXPyayY_-AibITCs</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System</title><source>arXiv.org</source><creator>Fan, Kai ; Wang, Jiayi ; Li, Bo ; Zhang, Shiliang ; Chen, Boxing ; Ge, Niyu ; Yan, Zhijie</creator><creatorcontrib>Fan, Kai ; Wang, Jiayi ; Li, Bo ; Zhang, Shiliang ; Chen, Boxing ; Ge, Niyu ; Yan, Zhijie</creatorcontrib><description>The performances of automatic speech recognition (ASR) systems are usually evaluated by the metric word error rate (WER) when the manually transcribed data are provided, which are, however, expensively available in the real scenario. In addition, the empirical distribution of WER for most ASR systems usually tends to put a significant mass near zero, making it difficult to simulate with a single continuous distribution. In order to address the two issues of ASR quality estimation (QE), we propose a novel neural zero-inflated model to predict the WER of the ASR result without transcripts. We design a neural zero-inflated beta regression on top of a bidirectional transformer language model conditional on speech features (speech-BERT). We adopt the pre-training strategy of token level mask language modeling for speech-BERT as well, and further fine-tune with our zero-inflated layer for the mixture of discrete and continuous outputs. The experimental results show that our approach achieves better performance on WER prediction in the metrics of Pearson and MAE, compared with most existed quality estimation algorithms for ASR or machine translation.</description><identifier>DOI: 10.48550/arxiv.1910.01289</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Sound</subject><creationdate>2019-10</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/1910.01289$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.1910.01289$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Fan, Kai</creatorcontrib><creatorcontrib>Wang, Jiayi</creatorcontrib><creatorcontrib>Li, Bo</creatorcontrib><creatorcontrib>Zhang, Shiliang</creatorcontrib><creatorcontrib>Chen, Boxing</creatorcontrib><creatorcontrib>Ge, Niyu</creatorcontrib><creatorcontrib>Yan, Zhijie</creatorcontrib><title>Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System</title><description>The performances of automatic speech recognition (ASR) systems are usually evaluated by the metric word error rate (WER) when the manually transcribed data are provided, which are, however, expensively available in the real scenario. In addition, the empirical distribution of WER for most ASR systems usually tends to put a significant mass near zero, making it difficult to simulate with a single continuous distribution. In order to address the two issues of ASR quality estimation (QE), we propose a novel neural zero-inflated model to predict the WER of the ASR result without transcripts. We design a neural zero-inflated beta regression on top of a bidirectional transformer language model conditional on speech features (speech-BERT). We adopt the pre-training strategy of token level mask language modeling for speech-BERT as well, and further fine-tune with our zero-inflated layer for the mixture of discrete and continuous outputs. The experimental results show that our approach achieves better performance on WER prediction in the metrics of Pearson and MAE, compared with most existed quality estimation algorithms for ASR or machine translation.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tOwzAURL1hgQofwAr_QEocx69lVbVQqRRBy4ZN5Mc1RHLjynEQ-XvawGqko9FoDkJ3pJzXkrHyQaef9ntO1BmUpJLqGr3vYEg64A9Isdh0PugMDr8OOrR5xKs-t0ed29jh5-gg4HVMeDHkeIEW708A9gu_gY2fXTvV9mOf4XiDrrwOPdz-5wwd1qvD8qnYvjxulottoblQhSbMyYqBYFSzynJuuLAEvPI1VYYJZ6Tx3gFVpOaupMY4QUxtFXgmufB0hu7_Ziev5pTOZ9PYXPyayY_-AibITCs</recordid><startdate>20191002</startdate><enddate>20191002</enddate><creator>Fan, Kai</creator><creator>Wang, Jiayi</creator><creator>Li, Bo</creator><creator>Zhang, Shiliang</creator><creator>Chen, Boxing</creator><creator>Ge, Niyu</creator><creator>Yan, Zhijie</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20191002</creationdate><title>Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System</title><author>Fan, Kai ; Wang, Jiayi ; Li, Bo ; Zhang, Shiliang ; Chen, Boxing ; Ge, Niyu ; Yan, Zhijie</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-a15d825e753a52c66b67c1ef9f439b57db8bffde39146d03bbd71b4c9ef5867f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Fan, Kai</creatorcontrib><creatorcontrib>Wang, Jiayi</creatorcontrib><creatorcontrib>Li, Bo</creatorcontrib><creatorcontrib>Zhang, Shiliang</creatorcontrib><creatorcontrib>Chen, Boxing</creatorcontrib><creatorcontrib>Ge, Niyu</creatorcontrib><creatorcontrib>Yan, Zhijie</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Fan, Kai</au><au>Wang, Jiayi</au><au>Li, Bo</au><au>Zhang, Shiliang</au><au>Chen, Boxing</au><au>Ge, Niyu</au><au>Yan, Zhijie</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System</atitle><date>2019-10-02</date><risdate>2019</risdate><abstract>The performances of automatic speech recognition (ASR) systems are usually evaluated by the metric word error rate (WER) when the manually transcribed data are provided, which are, however, expensively available in the real scenario. In addition, the empirical distribution of WER for most ASR systems usually tends to put a significant mass near zero, making it difficult to simulate with a single continuous distribution. In order to address the two issues of ASR quality estimation (QE), we propose a novel neural zero-inflated model to predict the WER of the ASR result without transcripts. We design a neural zero-inflated beta regression on top of a bidirectional transformer language model conditional on speech features (speech-BERT). We adopt the pre-training strategy of token level mask language modeling for speech-BERT as well, and further fine-tune with our zero-inflated layer for the mixture of discrete and continuous outputs. The experimental results show that our approach achieves better performance on WER prediction in the metrics of Pearson and MAE, compared with most existed quality estimation algorithms for ASR or machine translation.</abstract><doi>10.48550/arxiv.1910.01289</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.1910.01289
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_1910_01289
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Sound
title	Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T13%3A12%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Neural%20Zero-Inflated%20Quality%20Estimation%20Model%20For%20Automatic%20Speech%20Recognition%20System&rft.au=Fan,%20Kai&rft.date=2019-10-02&rft_id=info:doi/10.48550/arxiv.1910.01289&rft_dat=%3Carxiv_GOX%3E1910_01289%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true