The THUEE System Description for the IARPA OpenASR21 Challenge

This paper describes the THUEE team's speech recognition system for the IARPA Open Automatic Speech Recognition Challenge (OpenASR21), with further experiment explorations. We achieve outstanding results under both the Constrained and Constrained-plus training conditions. For the Constrained tr...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Zhao, Jing, Wang, Haoyu, Li, Jinpeng, Chai, Shuzhou, Wang, Guan-Bo, Chen, Guoguo, Zhang, Wei-Qiang
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Sound
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Zhao, Jing Wang, Haoyu Li, Jinpeng Chai, Shuzhou Wang, Guan-Bo Chen, Guoguo Zhang, Wei-Qiang
description	This paper describes the THUEE team's speech recognition system for the IARPA Open Automatic Speech Recognition Challenge (OpenASR21), with further experiment explorations. We achieve outstanding results under both the Constrained and Constrained-plus training conditions. For the Constrained training condition, we construct our basic ASR system based on the standard hybrid architecture. To alleviate the Out-Of-Vocabulary (OOV) problem, we extend the pronunciation lexicon using Grapheme-to-Phoneme (G2P) techniques for both OOV and potential new words. Standard acoustic model structures such as CNN-TDNN-F and CNN-TDNN-F-A are adopted. In addition, multiple data augmentation techniques are applied. For the Constrained-plus training condition, we use the self-supervised learning framework wav2vec2.0. We experiment with various fine-tuning techniques with the Connectionist Temporal Classification (CTC) criterion on top of the publicly available pre-trained model XLSR-53. We find that the frontend feature extractor plays an important role when applying the wav2vec2.0 pre-trained model to the encoder-decoder based CTC/Attention ASR architecture. Extra improvements can be achieved by using the CTC model finetuned in the target language as the frontend feature extractor.
doi_str_mv	10.48550/arxiv.2206.14660
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2206_14660</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2206_14660</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-a9739845890fd833de6f18024bd970aa5fa46f76b0fc5fd03477329fbb046b303</originalsourceid><addsrcrecordid>eNotz81qwkAYheHZdFFsL6Crzg0kfpn_2RRCTKsgKJquw4yZrwZiDJNQ6t23VVdn83LgIeQlg1QYKWHu4k_7nTIGKs2EUvBI3qpjoNXysyzp_jJO4UQXYTzEdpjac0_xHOn0F6zy3TanmyH0-X7HMlocXdeF_is8kQd03Rie7zsj1XtZFctkvflYFfk6cUpD4qzm1ghpLGBjOG-CwswAE76xGpyT6IRCrTzgQWIDXGjNmUXvQSjPgc_I6-32CqiH2J5cvNT_kPoK4b_XKkBf</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>The THUEE System Description for the IARPA OpenASR21 Challenge</title><source>arXiv.org</source><creator>Zhao, Jing ; Wang, Haoyu ; Li, Jinpeng ; Chai, Shuzhou ; Wang, Guan-Bo ; Chen, Guoguo ; Zhang, Wei-Qiang</creator><creatorcontrib>Zhao, Jing ; Wang, Haoyu ; Li, Jinpeng ; Chai, Shuzhou ; Wang, Guan-Bo ; Chen, Guoguo ; Zhang, Wei-Qiang</creatorcontrib><description>This paper describes the THUEE team's speech recognition system for the IARPA Open Automatic Speech Recognition Challenge (OpenASR21), with further experiment explorations. We achieve outstanding results under both the Constrained and Constrained-plus training conditions. For the Constrained training condition, we construct our basic ASR system based on the standard hybrid architecture. To alleviate the Out-Of-Vocabulary (OOV) problem, we extend the pronunciation lexicon using Grapheme-to-Phoneme (G2P) techniques for both OOV and potential new words. Standard acoustic model structures such as CNN-TDNN-F and CNN-TDNN-F-A are adopted. In addition, multiple data augmentation techniques are applied. For the Constrained-plus training condition, we use the self-supervised learning framework wav2vec2.0. We experiment with various fine-tuning techniques with the Connectionist Temporal Classification (CTC) criterion on top of the publicly available pre-trained model XLSR-53. We find that the frontend feature extractor plays an important role when applying the wav2vec2.0 pre-trained model to the encoder-decoder based CTC/Attention ASR architecture. Extra improvements can be achieved by using the CTC model finetuned in the target language as the frontend feature extractor.</description><identifier>DOI: 10.48550/arxiv.2206.14660</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Sound</subject><creationdate>2022-06</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2206.14660$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2206.14660$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhao, Jing</creatorcontrib><creatorcontrib>Wang, Haoyu</creatorcontrib><creatorcontrib>Li, Jinpeng</creatorcontrib><creatorcontrib>Chai, Shuzhou</creatorcontrib><creatorcontrib>Wang, Guan-Bo</creatorcontrib><creatorcontrib>Chen, Guoguo</creatorcontrib><creatorcontrib>Zhang, Wei-Qiang</creatorcontrib><title>The THUEE System Description for the IARPA OpenASR21 Challenge</title><description>This paper describes the THUEE team's speech recognition system for the IARPA Open Automatic Speech Recognition Challenge (OpenASR21), with further experiment explorations. We achieve outstanding results under both the Constrained and Constrained-plus training conditions. For the Constrained training condition, we construct our basic ASR system based on the standard hybrid architecture. To alleviate the Out-Of-Vocabulary (OOV) problem, we extend the pronunciation lexicon using Grapheme-to-Phoneme (G2P) techniques for both OOV and potential new words. Standard acoustic model structures such as CNN-TDNN-F and CNN-TDNN-F-A are adopted. In addition, multiple data augmentation techniques are applied. For the Constrained-plus training condition, we use the self-supervised learning framework wav2vec2.0. We experiment with various fine-tuning techniques with the Connectionist Temporal Classification (CTC) criterion on top of the publicly available pre-trained model XLSR-53. We find that the frontend feature extractor plays an important role when applying the wav2vec2.0 pre-trained model to the encoder-decoder based CTC/Attention ASR architecture. Extra improvements can be achieved by using the CTC model finetuned in the target language as the frontend feature extractor.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz81qwkAYheHZdFFsL6Crzg0kfpn_2RRCTKsgKJquw4yZrwZiDJNQ6t23VVdn83LgIeQlg1QYKWHu4k_7nTIGKs2EUvBI3qpjoNXysyzp_jJO4UQXYTzEdpjac0_xHOn0F6zy3TanmyH0-X7HMlocXdeF_is8kQd03Rie7zsj1XtZFctkvflYFfk6cUpD4qzm1ghpLGBjOG-CwswAE76xGpyT6IRCrTzgQWIDXGjNmUXvQSjPgc_I6-32CqiH2J5cvNT_kPoK4b_XKkBf</recordid><startdate>20220629</startdate><enddate>20220629</enddate><creator>Zhao, Jing</creator><creator>Wang, Haoyu</creator><creator>Li, Jinpeng</creator><creator>Chai, Shuzhou</creator><creator>Wang, Guan-Bo</creator><creator>Chen, Guoguo</creator><creator>Zhang, Wei-Qiang</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220629</creationdate><title>The THUEE System Description for the IARPA OpenASR21 Challenge</title><author>Zhao, Jing ; Wang, Haoyu ; Li, Jinpeng ; Chai, Shuzhou ; Wang, Guan-Bo ; Chen, Guoguo ; Zhang, Wei-Qiang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-a9739845890fd833de6f18024bd970aa5fa46f76b0fc5fd03477329fbb046b303</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhao, Jing</creatorcontrib><creatorcontrib>Wang, Haoyu</creatorcontrib><creatorcontrib>Li, Jinpeng</creatorcontrib><creatorcontrib>Chai, Shuzhou</creatorcontrib><creatorcontrib>Wang, Guan-Bo</creatorcontrib><creatorcontrib>Chen, Guoguo</creatorcontrib><creatorcontrib>Zhang, Wei-Qiang</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhao, Jing</au><au>Wang, Haoyu</au><au>Li, Jinpeng</au><au>Chai, Shuzhou</au><au>Wang, Guan-Bo</au><au>Chen, Guoguo</au><au>Zhang, Wei-Qiang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The THUEE System Description for the IARPA OpenASR21 Challenge</atitle><date>2022-06-29</date><risdate>2022</risdate><abstract>This paper describes the THUEE team's speech recognition system for the IARPA Open Automatic Speech Recognition Challenge (OpenASR21), with further experiment explorations. We achieve outstanding results under both the Constrained and Constrained-plus training conditions. For the Constrained training condition, we construct our basic ASR system based on the standard hybrid architecture. To alleviate the Out-Of-Vocabulary (OOV) problem, we extend the pronunciation lexicon using Grapheme-to-Phoneme (G2P) techniques for both OOV and potential new words. Standard acoustic model structures such as CNN-TDNN-F and CNN-TDNN-F-A are adopted. In addition, multiple data augmentation techniques are applied. For the Constrained-plus training condition, we use the self-supervised learning framework wav2vec2.0. We experiment with various fine-tuning techniques with the Connectionist Temporal Classification (CTC) criterion on top of the publicly available pre-trained model XLSR-53. We find that the frontend feature extractor plays an important role when applying the wav2vec2.0 pre-trained model to the encoder-decoder based CTC/Attention ASR architecture. Extra improvements can be achieved by using the CTC model finetuned in the target language as the frontend feature extractor.</abstract><doi>10.48550/arxiv.2206.14660</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2206.14660
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2206_14660
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Sound
title	The THUEE System Description for the IARPA OpenASR21 Challenge
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T10%3A19%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20THUEE%20System%20Description%20for%20the%20IARPA%20OpenASR21%20Challenge&rft.au=Zhao,%20Jing&rft.date=2022-06-29&rft_id=info:doi/10.48550/arxiv.2206.14660&rft_dat=%3Carxiv_GOX%3E2206_14660%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true