Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition

When labeled data is insufficient, semi-supervised learning with the pseudo-labeling technique can significantly improve the performance of automatic speech recognition. However, pseudo-labels are often noisy, containing numerous incorrect tokens. Taking noisy labels as ground-truth in the loss func...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2023, Vol.31, p.1-11
Hauptverfasser: Zhu, Han, Gao, Dongji, Cheng, Gaofeng, Povey, Daniel, Zhang, Pengyuan, Yan, Yonghong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 11
container_issue
container_start_page 1
container_title IEEE/ACM transactions on audio, speech, and language processing
container_volume 31
creator Zhu, Han
Gao, Dongji
Cheng, Gaofeng
Povey, Daniel
Zhang, Pengyuan
Yan, Yonghong
description When labeled data is insufficient, semi-supervised learning with the pseudo-labeling technique can significantly improve the performance of automatic speech recognition. However, pseudo-labels are often noisy, containing numerous incorrect tokens. Taking noisy labels as ground-truth in the loss function results in suboptimal performance. Previous works attempted to mitigate this issue by either filtering out the nosiest pseudo-labels or improving the overall quality of pseudo-labels. While these methods are effective to some extent, it is unrealistic to entirely eliminate incorrect tokens in pseudo-labels. In this work, we propose a novel framework named alternative pseudo-labeling to tackle the issue of noisy pseudo-labels from the perspective of the training objective. The framework comprises several components. Firstly, a generalized CTC loss function is introduced to handle noisy pseudo-labels by accepting alternative tokens in the positions of incorrect tokens. Applying this loss function in pseudo-labeling requires detecting incorrect tokens in the predicted pseudo-labels. In this work, we adopt a confidence-based error detection method that identifies the incorrect tokens by comparing their confidence scores with a given threshold, thus necessitating the confidence score to be discriminative. Hence, the second proposed technique is the contrastive CTC loss function that widens the confidence gap between the correctly and incorrectly predicted tokens, thereby improving the error detection ability. Additionally, obtaining satisfactory performance with confidence-based error detection typically requires extensive threshold tuning. Instead, we propose an automatic thresholding method that uses labeled data as a proxy for determining the threshold, thus saving the pain of manual tuning. Experiments demonstrate that alternative pseudo-labeling outperforms existing pseudo-labeling approaches on datasets in various domains and languages.
doi_str_mv 10.1109/TASLP.2023.3306709
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2861454356</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10225353</ieee_id><sourcerecordid>2861454356</sourcerecordid><originalsourceid>FETCH-LOGICAL-c2559-4d644eeb585463aa9db4221cde15a00bc61ac03ff2fe213d1ef7ad96af8903423</originalsourceid><addsrcrecordid>eNpNkE1Lw0AQQBdRsNT-AfEQ8Jy6n0n3GIpfELCYel42m9m6Je3G3aTgvze1FTzNHN4bmIfQLcFzQrB8WBdVuZpTTNmcMZzlWF6gCWVUppJhfvm3U4mv0SzGLcaY4FzKnE9QWbQ9hL3u3QGSVYSh8Wmpa2jdfpNYH5IKdi6thg7CwUVokmLo_W7ETVJ1AOYzeQfjN3vXO7-_QVdWtxFm5zlFH0-P6-VLWr49vy6LMjVUCJnyJuMcoBYLwTOmtWxqTikxDRChMa5NRrTBzFpqgRLWELC5bmSm7UJiximbovvT3S74rwFir7Z-GJ9oo6KLjHDBmchGip4oE3yMAazqgtvp8K0IVsdw6jecOoZT53CjdHeSHAD8EygVTDD2A8RSaeA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2861454356</pqid></control><display><type>article</type><title>Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition</title><source>IEEE Electronic Library (IEL)</source><creator>Zhu, Han ; Gao, Dongji ; Cheng, Gaofeng ; Povey, Daniel ; Zhang, Pengyuan ; Yan, Yonghong</creator><creatorcontrib>Zhu, Han ; Gao, Dongji ; Cheng, Gaofeng ; Povey, Daniel ; Zhang, Pengyuan ; Yan, Yonghong</creatorcontrib><description>When labeled data is insufficient, semi-supervised learning with the pseudo-labeling technique can significantly improve the performance of automatic speech recognition. However, pseudo-labels are often noisy, containing numerous incorrect tokens. Taking noisy labels as ground-truth in the loss function results in suboptimal performance. Previous works attempted to mitigate this issue by either filtering out the nosiest pseudo-labels or improving the overall quality of pseudo-labels. While these methods are effective to some extent, it is unrealistic to entirely eliminate incorrect tokens in pseudo-labels. In this work, we propose a novel framework named alternative pseudo-labeling to tackle the issue of noisy pseudo-labels from the perspective of the training objective. The framework comprises several components. Firstly, a generalized CTC loss function is introduced to handle noisy pseudo-labels by accepting alternative tokens in the positions of incorrect tokens. Applying this loss function in pseudo-labeling requires detecting incorrect tokens in the predicted pseudo-labels. In this work, we adopt a confidence-based error detection method that identifies the incorrect tokens by comparing their confidence scores with a given threshold, thus necessitating the confidence score to be discriminative. Hence, the second proposed technique is the contrastive CTC loss function that widens the confidence gap between the correctly and incorrectly predicted tokens, thereby improving the error detection ability. Additionally, obtaining satisfactory performance with confidence-based error detection typically requires extensive threshold tuning. Instead, we propose an automatic thresholding method that uses labeled data as a proxy for determining the threshold, thus saving the pain of manual tuning. Experiments demonstrate that alternative pseudo-labeling outperforms existing pseudo-labeling approaches on datasets in various domains and languages.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2023.3306709</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Automatic speech recognition ; Computational modeling ; Data models ; Error correction ; Error correction &amp; detection ; Error detection ; Estimation ; Filtering ; Labeling ; Labelling ; Labels ; Noise ; Noise measurement ; Performance enhancement ; pseudo-labeling ; Semi-supervised learning ; Speech recognition ; Thresholds ; Training ; Tuning ; Voice recognition</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2023, Vol.31, p.1-11</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c2559-4d644eeb585463aa9db4221cde15a00bc61ac03ff2fe213d1ef7ad96af8903423</citedby><cites>FETCH-LOGICAL-c2559-4d644eeb585463aa9db4221cde15a00bc61ac03ff2fe213d1ef7ad96af8903423</cites><orcidid>0000-0001-6838-5160 ; 0000-0001-6907-5770 ; 0000-0002-2102-6061 ; 0009-0002-5060-4454 ; 0009-0006-8885-3084 ; 0000-0002-0611-3634</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10225353$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,4024,27923,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10225353$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Zhu, Han</creatorcontrib><creatorcontrib>Gao, Dongji</creatorcontrib><creatorcontrib>Cheng, Gaofeng</creatorcontrib><creatorcontrib>Povey, Daniel</creatorcontrib><creatorcontrib>Zhang, Pengyuan</creatorcontrib><creatorcontrib>Yan, Yonghong</creatorcontrib><title>Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>When labeled data is insufficient, semi-supervised learning with the pseudo-labeling technique can significantly improve the performance of automatic speech recognition. However, pseudo-labels are often noisy, containing numerous incorrect tokens. Taking noisy labels as ground-truth in the loss function results in suboptimal performance. Previous works attempted to mitigate this issue by either filtering out the nosiest pseudo-labels or improving the overall quality of pseudo-labels. While these methods are effective to some extent, it is unrealistic to entirely eliminate incorrect tokens in pseudo-labels. In this work, we propose a novel framework named alternative pseudo-labeling to tackle the issue of noisy pseudo-labels from the perspective of the training objective. The framework comprises several components. Firstly, a generalized CTC loss function is introduced to handle noisy pseudo-labels by accepting alternative tokens in the positions of incorrect tokens. Applying this loss function in pseudo-labeling requires detecting incorrect tokens in the predicted pseudo-labels. In this work, we adopt a confidence-based error detection method that identifies the incorrect tokens by comparing their confidence scores with a given threshold, thus necessitating the confidence score to be discriminative. Hence, the second proposed technique is the contrastive CTC loss function that widens the confidence gap between the correctly and incorrectly predicted tokens, thereby improving the error detection ability. Additionally, obtaining satisfactory performance with confidence-based error detection typically requires extensive threshold tuning. Instead, we propose an automatic thresholding method that uses labeled data as a proxy for determining the threshold, thus saving the pain of manual tuning. Experiments demonstrate that alternative pseudo-labeling outperforms existing pseudo-labeling approaches on datasets in various domains and languages.</description><subject>Automatic speech recognition</subject><subject>Computational modeling</subject><subject>Data models</subject><subject>Error correction</subject><subject>Error correction &amp; detection</subject><subject>Error detection</subject><subject>Estimation</subject><subject>Filtering</subject><subject>Labeling</subject><subject>Labelling</subject><subject>Labels</subject><subject>Noise</subject><subject>Noise measurement</subject><subject>Performance enhancement</subject><subject>pseudo-labeling</subject><subject>Semi-supervised learning</subject><subject>Speech recognition</subject><subject>Thresholds</subject><subject>Training</subject><subject>Tuning</subject><subject>Voice recognition</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE1Lw0AQQBdRsNT-AfEQ8Jy6n0n3GIpfELCYel42m9m6Je3G3aTgvze1FTzNHN4bmIfQLcFzQrB8WBdVuZpTTNmcMZzlWF6gCWVUppJhfvm3U4mv0SzGLcaY4FzKnE9QWbQ9hL3u3QGSVYSh8Wmpa2jdfpNYH5IKdi6thg7CwUVokmLo_W7ETVJ1AOYzeQfjN3vXO7-_QVdWtxFm5zlFH0-P6-VLWr49vy6LMjVUCJnyJuMcoBYLwTOmtWxqTikxDRChMa5NRrTBzFpqgRLWELC5bmSm7UJiximbovvT3S74rwFir7Z-GJ9oo6KLjHDBmchGip4oE3yMAazqgtvp8K0IVsdw6jecOoZT53CjdHeSHAD8EygVTDD2A8RSaeA</recordid><startdate>2023</startdate><enddate>2023</enddate><creator>Zhu, Han</creator><creator>Gao, Dongji</creator><creator>Cheng, Gaofeng</creator><creator>Povey, Daniel</creator><creator>Zhang, Pengyuan</creator><creator>Yan, Yonghong</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7T9</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-6838-5160</orcidid><orcidid>https://orcid.org/0000-0001-6907-5770</orcidid><orcidid>https://orcid.org/0000-0002-2102-6061</orcidid><orcidid>https://orcid.org/0009-0002-5060-4454</orcidid><orcidid>https://orcid.org/0009-0006-8885-3084</orcidid><orcidid>https://orcid.org/0000-0002-0611-3634</orcidid></search><sort><creationdate>2023</creationdate><title>Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition</title><author>Zhu, Han ; Gao, Dongji ; Cheng, Gaofeng ; Povey, Daniel ; Zhang, Pengyuan ; Yan, Yonghong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c2559-4d644eeb585463aa9db4221cde15a00bc61ac03ff2fe213d1ef7ad96af8903423</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Automatic speech recognition</topic><topic>Computational modeling</topic><topic>Data models</topic><topic>Error correction</topic><topic>Error correction &amp; detection</topic><topic>Error detection</topic><topic>Estimation</topic><topic>Filtering</topic><topic>Labeling</topic><topic>Labelling</topic><topic>Labels</topic><topic>Noise</topic><topic>Noise measurement</topic><topic>Performance enhancement</topic><topic>pseudo-labeling</topic><topic>Semi-supervised learning</topic><topic>Speech recognition</topic><topic>Thresholds</topic><topic>Training</topic><topic>Tuning</topic><topic>Voice recognition</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhu, Han</creatorcontrib><creatorcontrib>Gao, Dongji</creatorcontrib><creatorcontrib>Cheng, Gaofeng</creatorcontrib><creatorcontrib>Povey, Daniel</creatorcontrib><creatorcontrib>Zhang, Pengyuan</creatorcontrib><creatorcontrib>Yan, Yonghong</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhu, Han</au><au>Gao, Dongji</au><au>Cheng, Gaofeng</au><au>Povey, Daniel</au><au>Zhang, Pengyuan</au><au>Yan, Yonghong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2023</date><risdate>2023</risdate><volume>31</volume><spage>1</spage><epage>11</epage><pages>1-11</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>When labeled data is insufficient, semi-supervised learning with the pseudo-labeling technique can significantly improve the performance of automatic speech recognition. However, pseudo-labels are often noisy, containing numerous incorrect tokens. Taking noisy labels as ground-truth in the loss function results in suboptimal performance. Previous works attempted to mitigate this issue by either filtering out the nosiest pseudo-labels or improving the overall quality of pseudo-labels. While these methods are effective to some extent, it is unrealistic to entirely eliminate incorrect tokens in pseudo-labels. In this work, we propose a novel framework named alternative pseudo-labeling to tackle the issue of noisy pseudo-labels from the perspective of the training objective. The framework comprises several components. Firstly, a generalized CTC loss function is introduced to handle noisy pseudo-labels by accepting alternative tokens in the positions of incorrect tokens. Applying this loss function in pseudo-labeling requires detecting incorrect tokens in the predicted pseudo-labels. In this work, we adopt a confidence-based error detection method that identifies the incorrect tokens by comparing their confidence scores with a given threshold, thus necessitating the confidence score to be discriminative. Hence, the second proposed technique is the contrastive CTC loss function that widens the confidence gap between the correctly and incorrectly predicted tokens, thereby improving the error detection ability. Additionally, obtaining satisfactory performance with confidence-based error detection typically requires extensive threshold tuning. Instead, we propose an automatic thresholding method that uses labeled data as a proxy for determining the threshold, thus saving the pain of manual tuning. Experiments demonstrate that alternative pseudo-labeling outperforms existing pseudo-labeling approaches on datasets in various domains and languages.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2023.3306709</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0001-6838-5160</orcidid><orcidid>https://orcid.org/0000-0001-6907-5770</orcidid><orcidid>https://orcid.org/0000-0002-2102-6061</orcidid><orcidid>https://orcid.org/0009-0002-5060-4454</orcidid><orcidid>https://orcid.org/0009-0006-8885-3084</orcidid><orcidid>https://orcid.org/0000-0002-0611-3634</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 2329-9290
ispartof IEEE/ACM transactions on audio, speech, and language processing, 2023, Vol.31, p.1-11
issn 2329-9290
2329-9304
language eng
recordid cdi_proquest_journals_2861454356
source IEEE Electronic Library (IEL)
subjects Automatic speech recognition
Computational modeling
Data models
Error correction
Error correction & detection
Error detection
Estimation
Filtering
Labeling
Labelling
Labels
Noise
Noise measurement
Performance enhancement
pseudo-labeling
Semi-supervised learning
Speech recognition
Thresholds
Training
Tuning
Voice recognition
title Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-27T12%3A22%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Alternative%20Pseudo-Labeling%20for%20Semi-Supervised%20Automatic%20Speech%20Recognition&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Zhu,%20Han&rft.date=2023&rft.volume=31&rft.spage=1&rft.epage=11&rft.pages=1-11&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2023.3306709&rft_dat=%3Cproquest_RIE%3E2861454356%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2861454356&rft_id=info:pmid/&rft_ieee_id=10225353&rfr_iscdi=true