Self-supervised Learning with Random-projection Quantizer for Speech Recognition
We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randoml...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Chiu, Chung-Cheng Qin, James Zhang, Yu Yu, Jiahui Wu, Yonghui |
description | We present a simple and effective self-supervised learning approach for
speech recognition. The approach learns a model to predict the masked speech
signals, in the form of discrete labels generated with a random-projection
quantizer. In particular the quantizer projects speech inputs with a randomly
initialized matrix, and does a nearest-neighbor lookup in a
randomly-initialized codebook. Neither the matrix nor the codebook is updated
during self-supervised learning. Since the random-projection quantizer is not
trained and is separated from the speech recognition model, the design makes
the approach flexible and is compatible with universal speech recognition
architecture. On LibriSpeech our approach achieves similar word-error-rates as
previous work using self-supervised learning with non-streaming models, and
provides lower word-error-rates and latency than wav2vec 2.0 and w2v-BERT with
streaming models. On multilingual tasks the approach also provides significant
improvement over wav2vec 2.0 and w2v-BERT. |
doi_str_mv | 10.48550/arxiv.2202.01855 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2202_01855</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2202_01855</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-2bae342eeb5e299a7ba1da88e41db7a746a0649648f2b44672cc15d63e15c8d03</originalsourceid><addsrcrecordid>eNotj8lqwzAURbXpoqT9gK6qH5AryRrsZQidwNAh2Ztn6TlVSCQjO-nw9U3Sri5cDgcOITeCF6rSmt9B_gqHQkouCy6OzyV5XeK2Z-N-wHwII3raIOQY4pp-humDvkP0aceGnDboppAifdtDnMIPZtqnTJcDojti6NI6hhNwRS562I54_b8zsnq4Xy2eWPPy-LyYNwyM1Ux2gKWSiJ1GWddgOxAeqgqV8J0Fqwxwo2qjql52ShkrnRPamxKFdpXn5Yzc_mnPSe2Qww7yd3tKa89p5S_k1Up0</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Self-supervised Learning with Random-projection Quantizer for Speech Recognition</title><source>arXiv.org</source><creator>Chiu, Chung-Cheng ; Qin, James ; Zhang, Yu ; Yu, Jiahui ; Wu, Yonghui</creator><creatorcontrib>Chiu, Chung-Cheng ; Qin, James ; Zhang, Yu ; Yu, Jiahui ; Wu, Yonghui</creatorcontrib><description>We present a simple and effective self-supervised learning approach for
speech recognition. The approach learns a model to predict the masked speech
signals, in the form of discrete labels generated with a random-projection
quantizer. In particular the quantizer projects speech inputs with a randomly
initialized matrix, and does a nearest-neighbor lookup in a
randomly-initialized codebook. Neither the matrix nor the codebook is updated
during self-supervised learning. Since the random-projection quantizer is not
trained and is separated from the speech recognition model, the design makes
the approach flexible and is compatible with universal speech recognition
architecture. On LibriSpeech our approach achieves similar word-error-rates as
previous work using self-supervised learning with non-streaming models, and
provides lower word-error-rates and latency than wav2vec 2.0 and w2v-BERT with
streaming models. On multilingual tasks the approach also provides significant
improvement over wav2vec 2.0 and w2v-BERT.</description><identifier>DOI: 10.48550/arxiv.2202.01855</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Sound</subject><creationdate>2022-02</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2202.01855$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2202.01855$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Chiu, Chung-Cheng</creatorcontrib><creatorcontrib>Qin, James</creatorcontrib><creatorcontrib>Zhang, Yu</creatorcontrib><creatorcontrib>Yu, Jiahui</creatorcontrib><creatorcontrib>Wu, Yonghui</creatorcontrib><title>Self-supervised Learning with Random-projection Quantizer for Speech Recognition</title><description>We present a simple and effective self-supervised learning approach for
speech recognition. The approach learns a model to predict the masked speech
signals, in the form of discrete labels generated with a random-projection
quantizer. In particular the quantizer projects speech inputs with a randomly
initialized matrix, and does a nearest-neighbor lookup in a
randomly-initialized codebook. Neither the matrix nor the codebook is updated
during self-supervised learning. Since the random-projection quantizer is not
trained and is separated from the speech recognition model, the design makes
the approach flexible and is compatible with universal speech recognition
architecture. On LibriSpeech our approach achieves similar word-error-rates as
previous work using self-supervised learning with non-streaming models, and
provides lower word-error-rates and latency than wav2vec 2.0 and w2v-BERT with
streaming models. On multilingual tasks the approach also provides significant
improvement over wav2vec 2.0 and w2v-BERT.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8lqwzAURbXpoqT9gK6qH5AryRrsZQidwNAh2Ztn6TlVSCQjO-nw9U3Sri5cDgcOITeCF6rSmt9B_gqHQkouCy6OzyV5XeK2Z-N-wHwII3raIOQY4pp-humDvkP0aceGnDboppAifdtDnMIPZtqnTJcDojti6NI6hhNwRS562I54_b8zsnq4Xy2eWPPy-LyYNwyM1Ux2gKWSiJ1GWddgOxAeqgqV8J0Fqwxwo2qjql52ShkrnRPamxKFdpXn5Yzc_mnPSe2Qww7yd3tKa89p5S_k1Up0</recordid><startdate>20220203</startdate><enddate>20220203</enddate><creator>Chiu, Chung-Cheng</creator><creator>Qin, James</creator><creator>Zhang, Yu</creator><creator>Yu, Jiahui</creator><creator>Wu, Yonghui</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220203</creationdate><title>Self-supervised Learning with Random-projection Quantizer for Speech Recognition</title><author>Chiu, Chung-Cheng ; Qin, James ; Zhang, Yu ; Yu, Jiahui ; Wu, Yonghui</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-2bae342eeb5e299a7ba1da88e41db7a746a0649648f2b44672cc15d63e15c8d03</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Chiu, Chung-Cheng</creatorcontrib><creatorcontrib>Qin, James</creatorcontrib><creatorcontrib>Zhang, Yu</creatorcontrib><creatorcontrib>Yu, Jiahui</creatorcontrib><creatorcontrib>Wu, Yonghui</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chiu, Chung-Cheng</au><au>Qin, James</au><au>Zhang, Yu</au><au>Yu, Jiahui</au><au>Wu, Yonghui</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Self-supervised Learning with Random-projection Quantizer for Speech Recognition</atitle><date>2022-02-03</date><risdate>2022</risdate><abstract>We present a simple and effective self-supervised learning approach for
speech recognition. The approach learns a model to predict the masked speech
signals, in the form of discrete labels generated with a random-projection
quantizer. In particular the quantizer projects speech inputs with a randomly
initialized matrix, and does a nearest-neighbor lookup in a
randomly-initialized codebook. Neither the matrix nor the codebook is updated
during self-supervised learning. Since the random-projection quantizer is not
trained and is separated from the speech recognition model, the design makes
the approach flexible and is compatible with universal speech recognition
architecture. On LibriSpeech our approach achieves similar word-error-rates as
previous work using self-supervised learning with non-streaming models, and
provides lower word-error-rates and latency than wav2vec 2.0 and w2v-BERT with
streaming models. On multilingual tasks the approach also provides significant
improvement over wav2vec 2.0 and w2v-BERT.</abstract><doi>10.48550/arxiv.2202.01855</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2202.01855 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2202_01855 |
source | arXiv.org |
subjects | Computer Science - Computation and Language Computer Science - Sound |
title | Self-supervised Learning with Random-projection Quantizer for Speech Recognition |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-14T02%3A20%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Self-supervised%20Learning%20with%20Random-projection%20Quantizer%20for%20Speech%20Recognition&rft.au=Chiu,%20Chung-Cheng&rft.date=2022-02-03&rft_id=info:doi/10.48550/arxiv.2202.01855&rft_dat=%3Carxiv_GOX%3E2202_01855%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |