Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System
The performances of automatic speech recognition (ASR) systems are usually evaluated by the metric word error rate (WER) when the manually transcribed data are provided, which are, however, expensively available in the real scenario. In addition, the empirical distribution of WER for most ASR system...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Fan, Kai Wang, Jiayi Li, Bo Zhang, Shiliang Chen, Boxing Ge, Niyu Yan, Zhijie |
description | The performances of automatic speech recognition (ASR) systems are usually
evaluated by the metric word error rate (WER) when the manually transcribed
data are provided, which are, however, expensively available in the real
scenario. In addition, the empirical distribution of WER for most ASR systems
usually tends to put a significant mass near zero, making it difficult to
simulate with a single continuous distribution. In order to address the two
issues of ASR quality estimation (QE), we propose a novel neural zero-inflated
model to predict the WER of the ASR result without transcripts. We design a
neural zero-inflated beta regression on top of a bidirectional transformer
language model conditional on speech features (speech-BERT). We adopt the
pre-training strategy of token level mask language modeling for speech-BERT as
well, and further fine-tune with our zero-inflated layer for the mixture of
discrete and continuous outputs. The experimental results show that our
approach achieves better performance on WER prediction in the metrics of
Pearson and MAE, compared with most existed quality estimation algorithms for
ASR or machine translation. |
doi_str_mv | 10.48550/arxiv.1910.01289 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_1910_01289</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1910_01289</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-a15d825e753a52c66b67c1ef9f439b57db8bffde39146d03bbd71b4c9ef5867f3</originalsourceid><addsrcrecordid>eNotj8tOwzAURL1hgQofwAr_QEocx69lVbVQqRRBy4ZN5Mc1RHLjynEQ-XvawGqko9FoDkJ3pJzXkrHyQaef9ntO1BmUpJLqGr3vYEg64A9Isdh0PugMDr8OOrR5xKs-t0ed29jh5-gg4HVMeDHkeIEW708A9gu_gY2fXTvV9mOf4XiDrrwOPdz-5wwd1qvD8qnYvjxulottoblQhSbMyYqBYFSzynJuuLAEvPI1VYYJZ6Tx3gFVpOaupMY4QUxtFXgmufB0hu7_Ziev5pTOZ9PYXPyayY_-AibITCs</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System</title><source>arXiv.org</source><creator>Fan, Kai ; Wang, Jiayi ; Li, Bo ; Zhang, Shiliang ; Chen, Boxing ; Ge, Niyu ; Yan, Zhijie</creator><creatorcontrib>Fan, Kai ; Wang, Jiayi ; Li, Bo ; Zhang, Shiliang ; Chen, Boxing ; Ge, Niyu ; Yan, Zhijie</creatorcontrib><description>The performances of automatic speech recognition (ASR) systems are usually
evaluated by the metric word error rate (WER) when the manually transcribed
data are provided, which are, however, expensively available in the real
scenario. In addition, the empirical distribution of WER for most ASR systems
usually tends to put a significant mass near zero, making it difficult to
simulate with a single continuous distribution. In order to address the two
issues of ASR quality estimation (QE), we propose a novel neural zero-inflated
model to predict the WER of the ASR result without transcripts. We design a
neural zero-inflated beta regression on top of a bidirectional transformer
language model conditional on speech features (speech-BERT). We adopt the
pre-training strategy of token level mask language modeling for speech-BERT as
well, and further fine-tune with our zero-inflated layer for the mixture of
discrete and continuous outputs. The experimental results show that our
approach achieves better performance on WER prediction in the metrics of
Pearson and MAE, compared with most existed quality estimation algorithms for
ASR or machine translation.</description><identifier>DOI: 10.48550/arxiv.1910.01289</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Sound</subject><creationdate>2019-10</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/1910.01289$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.1910.01289$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Fan, Kai</creatorcontrib><creatorcontrib>Wang, Jiayi</creatorcontrib><creatorcontrib>Li, Bo</creatorcontrib><creatorcontrib>Zhang, Shiliang</creatorcontrib><creatorcontrib>Chen, Boxing</creatorcontrib><creatorcontrib>Ge, Niyu</creatorcontrib><creatorcontrib>Yan, Zhijie</creatorcontrib><title>Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System</title><description>The performances of automatic speech recognition (ASR) systems are usually
evaluated by the metric word error rate (WER) when the manually transcribed
data are provided, which are, however, expensively available in the real
scenario. In addition, the empirical distribution of WER for most ASR systems
usually tends to put a significant mass near zero, making it difficult to
simulate with a single continuous distribution. In order to address the two
issues of ASR quality estimation (QE), we propose a novel neural zero-inflated
model to predict the WER of the ASR result without transcripts. We design a
neural zero-inflated beta regression on top of a bidirectional transformer
language model conditional on speech features (speech-BERT). We adopt the
pre-training strategy of token level mask language modeling for speech-BERT as
well, and further fine-tune with our zero-inflated layer for the mixture of
discrete and continuous outputs. The experimental results show that our
approach achieves better performance on WER prediction in the metrics of
Pearson and MAE, compared with most existed quality estimation algorithms for
ASR or machine translation.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tOwzAURL1hgQofwAr_QEocx69lVbVQqRRBy4ZN5Mc1RHLjynEQ-XvawGqko9FoDkJ3pJzXkrHyQaef9ntO1BmUpJLqGr3vYEg64A9Isdh0PugMDr8OOrR5xKs-t0ed29jh5-gg4HVMeDHkeIEW708A9gu_gY2fXTvV9mOf4XiDrrwOPdz-5wwd1qvD8qnYvjxulottoblQhSbMyYqBYFSzynJuuLAEvPI1VYYJZ6Tx3gFVpOaupMY4QUxtFXgmufB0hu7_Ziev5pTOZ9PYXPyayY_-AibITCs</recordid><startdate>20191002</startdate><enddate>20191002</enddate><creator>Fan, Kai</creator><creator>Wang, Jiayi</creator><creator>Li, Bo</creator><creator>Zhang, Shiliang</creator><creator>Chen, Boxing</creator><creator>Ge, Niyu</creator><creator>Yan, Zhijie</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20191002</creationdate><title>Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System</title><author>Fan, Kai ; Wang, Jiayi ; Li, Bo ; Zhang, Shiliang ; Chen, Boxing ; Ge, Niyu ; Yan, Zhijie</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-a15d825e753a52c66b67c1ef9f439b57db8bffde39146d03bbd71b4c9ef5867f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Fan, Kai</creatorcontrib><creatorcontrib>Wang, Jiayi</creatorcontrib><creatorcontrib>Li, Bo</creatorcontrib><creatorcontrib>Zhang, Shiliang</creatorcontrib><creatorcontrib>Chen, Boxing</creatorcontrib><creatorcontrib>Ge, Niyu</creatorcontrib><creatorcontrib>Yan, Zhijie</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Fan, Kai</au><au>Wang, Jiayi</au><au>Li, Bo</au><au>Zhang, Shiliang</au><au>Chen, Boxing</au><au>Ge, Niyu</au><au>Yan, Zhijie</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System</atitle><date>2019-10-02</date><risdate>2019</risdate><abstract>The performances of automatic speech recognition (ASR) systems are usually
evaluated by the metric word error rate (WER) when the manually transcribed
data are provided, which are, however, expensively available in the real
scenario. In addition, the empirical distribution of WER for most ASR systems
usually tends to put a significant mass near zero, making it difficult to
simulate with a single continuous distribution. In order to address the two
issues of ASR quality estimation (QE), we propose a novel neural zero-inflated
model to predict the WER of the ASR result without transcripts. We design a
neural zero-inflated beta regression on top of a bidirectional transformer
language model conditional on speech features (speech-BERT). We adopt the
pre-training strategy of token level mask language modeling for speech-BERT as
well, and further fine-tune with our zero-inflated layer for the mixture of
discrete and continuous outputs. The experimental results show that our
approach achieves better performance on WER prediction in the metrics of
Pearson and MAE, compared with most existed quality estimation algorithms for
ASR or machine translation.</abstract><doi>10.48550/arxiv.1910.01289</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.1910.01289 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_1910_01289 |
source | arXiv.org |
subjects | Computer Science - Computation and Language Computer Science - Sound |
title | Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T13%3A12%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Neural%20Zero-Inflated%20Quality%20Estimation%20Model%20For%20Automatic%20Speech%20Recognition%20System&rft.au=Fan,%20Kai&rft.date=2019-10-02&rft_id=info:doi/10.48550/arxiv.1910.01289&rft_dat=%3Carxiv_GOX%3E1910_01289%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |