Self-supervised Learning with Random-projection Quantizer for Speech Recognition

We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randoml...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Chiu, Chung-Cheng, Qin, James, Zhang, Yu, Yu, Jiahui, Wu, Yonghui
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Sound
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Chiu, Chung-Cheng Qin, James Zhang, Yu Yu, Jiahui Wu, Yonghui
description	We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook. Neither the matrix nor the codebook is updated during self-supervised learning. Since the random-projection quantizer is not trained and is separated from the speech recognition model, the design makes the approach flexible and is compatible with universal speech recognition architecture. On LibriSpeech our approach achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models, and provides lower word-error-rates and latency than wav2vec 2.0 and w2v-BERT with streaming models. On multilingual tasks the approach also provides significant improvement over wav2vec 2.0 and w2v-BERT.
doi_str_mv	10.48550/arxiv.2202.01855
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2202_01855</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2202_01855</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-2bae342eeb5e299a7ba1da88e41db7a746a0649648f2b44672cc15d63e15c8d03</originalsourceid><addsrcrecordid>eNotj8lqwzAURbXpoqT9gK6qH5AryRrsZQidwNAh2Ztn6TlVSCQjO-nw9U3Sri5cDgcOITeCF6rSmt9B_gqHQkouCy6OzyV5XeK2Z-N-wHwII3raIOQY4pp-humDvkP0aceGnDboppAifdtDnMIPZtqnTJcDojti6NI6hhNwRS562I54_b8zsnq4Xy2eWPPy-LyYNwyM1Ux2gKWSiJ1GWddgOxAeqgqV8J0Fqwxwo2qjql52ShkrnRPamxKFdpXn5Yzc_mnPSe2Qww7yd3tKa89p5S_k1Up0</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Self-supervised Learning with Random-projection Quantizer for Speech Recognition</title><source>arXiv.org</source><creator>Chiu, Chung-Cheng ; Qin, James ; Zhang, Yu ; Yu, Jiahui ; Wu, Yonghui</creator><creatorcontrib>Chiu, Chung-Cheng ; Qin, James ; Zhang, Yu ; Yu, Jiahui ; Wu, Yonghui</creatorcontrib><description>We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook. Neither the matrix nor the codebook is updated during self-supervised learning. Since the random-projection quantizer is not trained and is separated from the speech recognition model, the design makes the approach flexible and is compatible with universal speech recognition architecture. On LibriSpeech our approach achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models, and provides lower word-error-rates and latency than wav2vec 2.0 and w2v-BERT with streaming models. On multilingual tasks the approach also provides significant improvement over wav2vec 2.0 and w2v-BERT.</description><identifier>DOI: 10.48550/arxiv.2202.01855</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Sound</subject><creationdate>2022-02</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2202.01855$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2202.01855$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Chiu, Chung-Cheng</creatorcontrib><creatorcontrib>Qin, James</creatorcontrib><creatorcontrib>Zhang, Yu</creatorcontrib><creatorcontrib>Yu, Jiahui</creatorcontrib><creatorcontrib>Wu, Yonghui</creatorcontrib><title>Self-supervised Learning with Random-projection Quantizer for Speech Recognition</title><description>We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook. Neither the matrix nor the codebook is updated during self-supervised learning. Since the random-projection quantizer is not trained and is separated from the speech recognition model, the design makes the approach flexible and is compatible with universal speech recognition architecture. On LibriSpeech our approach achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models, and provides lower word-error-rates and latency than wav2vec 2.0 and w2v-BERT with streaming models. On multilingual tasks the approach also provides significant improvement over wav2vec 2.0 and w2v-BERT.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8lqwzAURbXpoqT9gK6qH5AryRrsZQidwNAh2Ztn6TlVSCQjO-nw9U3Sri5cDgcOITeCF6rSmt9B_gqHQkouCy6OzyV5XeK2Z-N-wHwII3raIOQY4pp-humDvkP0aceGnDboppAifdtDnMIPZtqnTJcDojti6NI6hhNwRS562I54_b8zsnq4Xy2eWPPy-LyYNwyM1Ux2gKWSiJ1GWddgOxAeqgqV8J0Fqwxwo2qjql52ShkrnRPamxKFdpXn5Yzc_mnPSe2Qww7yd3tKa89p5S_k1Up0</recordid><startdate>20220203</startdate><enddate>20220203</enddate><creator>Chiu, Chung-Cheng</creator><creator>Qin, James</creator><creator>Zhang, Yu</creator><creator>Yu, Jiahui</creator><creator>Wu, Yonghui</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220203</creationdate><title>Self-supervised Learning with Random-projection Quantizer for Speech Recognition</title><author>Chiu, Chung-Cheng ; Qin, James ; Zhang, Yu ; Yu, Jiahui ; Wu, Yonghui</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-2bae342eeb5e299a7ba1da88e41db7a746a0649648f2b44672cc15d63e15c8d03</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Chiu, Chung-Cheng</creatorcontrib><creatorcontrib>Qin, James</creatorcontrib><creatorcontrib>Zhang, Yu</creatorcontrib><creatorcontrib>Yu, Jiahui</creatorcontrib><creatorcontrib>Wu, Yonghui</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chiu, Chung-Cheng</au><au>Qin, James</au><au>Zhang, Yu</au><au>Yu, Jiahui</au><au>Wu, Yonghui</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Self-supervised Learning with Random-projection Quantizer for Speech Recognition</atitle><date>2022-02-03</date><risdate>2022</risdate><abstract>We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook. Neither the matrix nor the codebook is updated during self-supervised learning. Since the random-projection quantizer is not trained and is separated from the speech recognition model, the design makes the approach flexible and is compatible with universal speech recognition architecture. On LibriSpeech our approach achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models, and provides lower word-error-rates and latency than wav2vec 2.0 and w2v-BERT with streaming models. On multilingual tasks the approach also provides significant improvement over wav2vec 2.0 and w2v-BERT.</abstract><doi>10.48550/arxiv.2202.01855</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2202.01855
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2202_01855
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Sound
title	Self-supervised Learning with Random-projection Quantizer for Speech Recognition
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-14T02%3A20%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Self-supervised%20Learning%20with%20Random-projection%20Quantizer%20for%20Speech%20Recognition&rft.au=Chiu,%20Chung-Cheng&rft.date=2022-02-03&rft_id=info:doi/10.48550/arxiv.2202.01855&rft_dat=%3Carxiv_GOX%3E2202_01855%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true