Estimating Speech Recognition Accuracy Based on Error Type Classification

Methods for estimating the speech recognition accuracy without using manually transcribed references are beneficial to the research and development of automatic speech recognition technology. This paper proposes recognition accuracy estimation methods based on error type classification (ETC). ETC is...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2016-12, Vol.24 (12), p.2400-2413
Hauptverfasser:	Ogawa, Atsunori, Hori, Takaaki, Nakamura, Atsushi
Format:	Artikel
Sprache:	eng
Schlagworte:	Automatic speech recognition Character recognition conditional random fields error type classification Estimation Feature extraction Probabilistic logic recognition accuracy estimation Speech recognition Target recognition word alignment network
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	2413
container_issue	12
container_start_page	2400
container_title	IEEE/ACM transactions on audio, speech, and language processing
container_volume	24
creator	Ogawa, Atsunori Hori, Takaaki Nakamura, Atsushi
description	Methods for estimating the speech recognition accuracy without using manually transcribed references are beneficial to the research and development of automatic speech recognition technology. This paper proposes recognition accuracy estimation methods based on error type classification (ETC). ETC is an extension of confidence estimation. In ETC, each word in the recognition results (recognized word sequences) for the target speech data is probabilistically classified into three categories: the correct recognition (C), substitution error (S), and insertion error (I). Deletion errors (D) that can occur at interword positions in the recognition results are also probabilistically detected. By summing these CSID probabilities individually, the numbers of CSIDs and, as a result, the two standard recognition accuracy measures, i.e., the percent correct and word accuracy (WAcc), for the speech data can be estimated without using the reference transcriptions. Two recognition accuracy estimation methods based on ETC are proposed. In the first easy-to-use method, ETC is performed by converting the recognition results represented as word confusion networks into word alignment networks (WANs). In the second and more accurate method, the WAN-based ETC results are refined with conditional random fields (CRFs) using various types of additional features extracted for each of the recognized words. Experiments using English and Japanese lecture speech corpora show that the recognition accuracy can be accurately estimated with the CRF-based method. The correlation coefficient and root mean square error between the lecture-level true WAccs calculated using the reference transcriptions and those estimated with the CRF-based method are 0.97 and lower than 2%, respectively. A series of additional experiments and analyses are also conducted to better understand the effectiveness of the CRF-based method.
doi_str_mv	10.1109/TASLP.2016.2603599
format	Article
fullrecord	<record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TASLP_2016_2603599</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>7553582</ieee_id><sourcerecordid>10_1109_TASLP_2016_2603599</sourcerecordid><originalsourceid>FETCH-LOGICAL-c403t-c8642e6ada5c8f0e735a4988dee39241d7ea2229864d2c610e01a8e50274bf0f3</originalsourceid><addsrcrecordid>eNo9kMtOwzAQRS0EElXpD8DGP5AwHttJvCxVgEqRQDSsI-NMilFpKjss-vekD1jN1eiekeYwdisgFQLMfT1fVa8pgshSzEBqYy7YBCWaxEhQl38ZDVyzWYxfACAgNyZXE7Ys4-C_7eC3a77aEblP_kauX2_94Pstnzv3E6zb8wcbqeXjpgyhD7ze74gvNjZG33lnD90bdtXZTaTZeU7Z-2NZL56T6uVpuZhXiVMgh8QVmULKbGu1KzqgXGqrTFG0RNKgEm1OFhHNWGvRZQIIhC1IA-bqo4NOThme7rrQxxioa3Zh_CDsGwHNwUdz9NEcfDRnHyN0d4I8Ef0DudZSFyh_AZffXC8</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Estimating Speech Recognition Accuracy Based on Error Type Classification</title><source>IEEE Electronic Library (IEL)</source><creator>Ogawa, Atsunori ; Hori, Takaaki ; Nakamura, Atsushi</creator><creatorcontrib>Ogawa, Atsunori ; Hori, Takaaki ; Nakamura, Atsushi</creatorcontrib><description>Methods for estimating the speech recognition accuracy without using manually transcribed references are beneficial to the research and development of automatic speech recognition technology. This paper proposes recognition accuracy estimation methods based on error type classification (ETC). ETC is an extension of confidence estimation. In ETC, each word in the recognition results (recognized word sequences) for the target speech data is probabilistically classified into three categories: the correct recognition (C), substitution error (S), and insertion error (I). Deletion errors (D) that can occur at interword positions in the recognition results are also probabilistically detected. By summing these CSID probabilities individually, the numbers of CSIDs and, as a result, the two standard recognition accuracy measures, i.e., the percent correct and word accuracy (WAcc), for the speech data can be estimated without using the reference transcriptions. Two recognition accuracy estimation methods based on ETC are proposed. In the first easy-to-use method, ETC is performed by converting the recognition results represented as word confusion networks into word alignment networks (WANs). In the second and more accurate method, the WAN-based ETC results are refined with conditional random fields (CRFs) using various types of additional features extracted for each of the recognized words. Experiments using English and Japanese lecture speech corpora show that the recognition accuracy can be accurately estimated with the CRF-based method. The correlation coefficient and root mean square error between the lecture-level true WAccs calculated using the reference transcriptions and those estimated with the CRF-based method are 0.97 and lower than 2%, respectively. A series of additional experiments and analyses are also conducted to better understand the effectiveness of the CRF-based method.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2016.2603599</identifier><identifier>CODEN: ITASD8</identifier><language>eng</language><publisher>IEEE</publisher><subject>Automatic speech recognition ; Character recognition ; conditional random fields ; error type classification ; Estimation ; Feature extraction ; Probabilistic logic ; recognition accuracy estimation ; Speech recognition ; Target recognition ; word alignment network</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2016-12, Vol.24 (12), p.2400-2413</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c403t-c8642e6ada5c8f0e735a4988dee39241d7ea2229864d2c610e01a8e50274bf0f3</citedby><cites>FETCH-LOGICAL-c403t-c8642e6ada5c8f0e735a4988dee39241d7ea2229864d2c610e01a8e50274bf0f3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/7553582$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/7553582$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Ogawa, Atsunori</creatorcontrib><creatorcontrib>Hori, Takaaki</creatorcontrib><creatorcontrib>Nakamura, Atsushi</creatorcontrib><title>Estimating Speech Recognition Accuracy Based on Error Type Classification</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>Methods for estimating the speech recognition accuracy without using manually transcribed references are beneficial to the research and development of automatic speech recognition technology. This paper proposes recognition accuracy estimation methods based on error type classification (ETC). ETC is an extension of confidence estimation. In ETC, each word in the recognition results (recognized word sequences) for the target speech data is probabilistically classified into three categories: the correct recognition (C), substitution error (S), and insertion error (I). Deletion errors (D) that can occur at interword positions in the recognition results are also probabilistically detected. By summing these CSID probabilities individually, the numbers of CSIDs and, as a result, the two standard recognition accuracy measures, i.e., the percent correct and word accuracy (WAcc), for the speech data can be estimated without using the reference transcriptions. Two recognition accuracy estimation methods based on ETC are proposed. In the first easy-to-use method, ETC is performed by converting the recognition results represented as word confusion networks into word alignment networks (WANs). In the second and more accurate method, the WAN-based ETC results are refined with conditional random fields (CRFs) using various types of additional features extracted for each of the recognized words. Experiments using English and Japanese lecture speech corpora show that the recognition accuracy can be accurately estimated with the CRF-based method. The correlation coefficient and root mean square error between the lecture-level true WAccs calculated using the reference transcriptions and those estimated with the CRF-based method are 0.97 and lower than 2%, respectively. A series of additional experiments and analyses are also conducted to better understand the effectiveness of the CRF-based method.</description><subject>Automatic speech recognition</subject><subject>Character recognition</subject><subject>conditional random fields</subject><subject>error type classification</subject><subject>Estimation</subject><subject>Feature extraction</subject><subject>Probabilistic logic</subject><subject>recognition accuracy estimation</subject><subject>Speech recognition</subject><subject>Target recognition</subject><subject>word alignment network</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2016</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kMtOwzAQRS0EElXpD8DGP5AwHttJvCxVgEqRQDSsI-NMilFpKjss-vekD1jN1eiekeYwdisgFQLMfT1fVa8pgshSzEBqYy7YBCWaxEhQl38ZDVyzWYxfACAgNyZXE7Ys4-C_7eC3a77aEblP_kauX2_94Pstnzv3E6zb8wcbqeXjpgyhD7ze74gvNjZG33lnD90bdtXZTaTZeU7Z-2NZL56T6uVpuZhXiVMgh8QVmULKbGu1KzqgXGqrTFG0RNKgEm1OFhHNWGvRZQIIhC1IA-bqo4NOThme7rrQxxioa3Zh_CDsGwHNwUdz9NEcfDRnHyN0d4I8Ef0DudZSFyh_AZffXC8</recordid><startdate>20161201</startdate><enddate>20161201</enddate><creator>Ogawa, Atsunori</creator><creator>Hori, Takaaki</creator><creator>Nakamura, Atsushi</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20161201</creationdate><title>Estimating Speech Recognition Accuracy Based on Error Type Classification</title><author>Ogawa, Atsunori ; Hori, Takaaki ; Nakamura, Atsushi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c403t-c8642e6ada5c8f0e735a4988dee39241d7ea2229864d2c610e01a8e50274bf0f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2016</creationdate><topic>Automatic speech recognition</topic><topic>Character recognition</topic><topic>conditional random fields</topic><topic>error type classification</topic><topic>Estimation</topic><topic>Feature extraction</topic><topic>Probabilistic logic</topic><topic>recognition accuracy estimation</topic><topic>Speech recognition</topic><topic>Target recognition</topic><topic>word alignment network</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ogawa, Atsunori</creatorcontrib><creatorcontrib>Hori, Takaaki</creatorcontrib><creatorcontrib>Nakamura, Atsushi</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ogawa, Atsunori</au><au>Hori, Takaaki</au><au>Nakamura, Atsushi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Estimating Speech Recognition Accuracy Based on Error Type Classification</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2016-12-01</date><risdate>2016</risdate><volume>24</volume><issue>12</issue><spage>2400</spage><epage>2413</epage><pages>2400-2413</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASD8</coden><abstract>Methods for estimating the speech recognition accuracy without using manually transcribed references are beneficial to the research and development of automatic speech recognition technology. This paper proposes recognition accuracy estimation methods based on error type classification (ETC). ETC is an extension of confidence estimation. In ETC, each word in the recognition results (recognized word sequences) for the target speech data is probabilistically classified into three categories: the correct recognition (C), substitution error (S), and insertion error (I). Deletion errors (D) that can occur at interword positions in the recognition results are also probabilistically detected. By summing these CSID probabilities individually, the numbers of CSIDs and, as a result, the two standard recognition accuracy measures, i.e., the percent correct and word accuracy (WAcc), for the speech data can be estimated without using the reference transcriptions. Two recognition accuracy estimation methods based on ETC are proposed. In the first easy-to-use method, ETC is performed by converting the recognition results represented as word confusion networks into word alignment networks (WANs). In the second and more accurate method, the WAN-based ETC results are refined with conditional random fields (CRFs) using various types of additional features extracted for each of the recognized words. Experiments using English and Japanese lecture speech corpora show that the recognition accuracy can be accurately estimated with the CRF-based method. The correlation coefficient and root mean square error between the lecture-level true WAccs calculated using the reference transcriptions and those estimated with the CRF-based method are 0.97 and lower than 2%, respectively. A series of additional experiments and analyses are also conducted to better understand the effectiveness of the CRF-based method.</abstract><pub>IEEE</pub><doi>10.1109/TASLP.2016.2603599</doi><tpages>14</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 2329-9290
ispartof	IEEE/ACM transactions on audio, speech, and language processing, 2016-12, Vol.24 (12), p.2400-2413
issn	2329-9290 2329-9304
language	eng
recordid	cdi_crossref_primary_10_1109_TASLP_2016_2603599
source	IEEE Electronic Library (IEL)
subjects	Automatic speech recognition Character recognition conditional random fields error type classification Estimation Feature extraction Probabilistic logic recognition accuracy estimation Speech recognition Target recognition word alignment network
title	Estimating Speech Recognition Accuracy Based on Error Type Classification
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T00%3A02%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Estimating%20Speech%20Recognition%20Accuracy%20Based%20on%20Error%20Type%20Classification&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Ogawa,%20Atsunori&rft.date=2016-12-01&rft.volume=24&rft.issue=12&rft.spage=2400&rft.epage=2413&rft.pages=2400-2413&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASD8&rft_id=info:doi/10.1109/TASLP.2016.2603599&rft_dat=%3Ccrossref_RIE%3E10_1109_TASLP_2016_2603599%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=7553582&rfr_iscdi=true