Scaling Laws for Discriminative Speech Recognition Rescoring Models

Recent studies have found that model performance has a smooth power-law relationship, or scaling laws, with training data and model size, for a wide range of problems. These scaling laws allow one to choose nearly optimal data and model sizes. We study whether this scaling property is also applicabl...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Gu, Yile, Shivakumar, Prashanth Gurunath, Kolehmainen, Jari, Gandhe, Ankur, Rastrow, Ariya, Bulyko, Ivan
Format:	Artikel
Sprache:	eng
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Gu, Yile Shivakumar, Prashanth Gurunath Kolehmainen, Jari Gandhe, Ankur Rastrow, Ariya Bulyko, Ivan
description	Recent studies have found that model performance has a smooth power-law relationship, or scaling laws, with training data and model size, for a wide range of problems. These scaling laws allow one to choose nearly optimal data and model sizes. We study whether this scaling property is also applicable to second-pass rescoring, which is an important component of speech recognition systems. We focus on RescoreBERT as the rescoring model, which uses a pre-trained Transformer-based architecture fined tuned with an ASR discriminative loss. Using such a rescoring model, we show that the word error rate (WER) follows a scaling law for over two orders of magnitude as training data and model size increase. In addition, it is found that a pre-trained model would require less data than a randomly initialized model of the same size, representing effective data transferred from pre-training step. This effective data transferred is found to also follow a scaling law with the data and model size.
doi_str_mv	10.48550/arxiv.2306.15815
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2306_15815</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2306_15815</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-7d8bcdf9419e3a60cf64a5c618d8672479e30092a523c99e30c960c44feb19683</originalsourceid><addsrcrecordid>eNotj81OwzAQhH3pARUegBN-gQQ7_ol9ROFXSlWp7T3abuxiKY0ruyrw9iSF0-6OZkfzEXLPWSmNUuwR0ne4lJVguuTKcHVDmi3CEMYDbeErUx8TfQ4ZUziGEc7h4uj25Bx-0o3DeBjDOcRx2jPGND-tYu-GfEsWHobs7v7nkuxeX3bNe9Gu3z6ap7YAXaui7s0ee28lt06AZui1BIWam97oupL1JDNmK1CVQDsfaCeXlN7tudVGLMnDX-yVojtNJSH9dDNNd6URv_5ERHk</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Scaling Laws for Discriminative Speech Recognition Rescoring Models</title><source>arXiv.org</source><creator>Gu, Yile ; Shivakumar, Prashanth Gurunath ; Kolehmainen, Jari ; Gandhe, Ankur ; Rastrow, Ariya ; Bulyko, Ivan</creator><creatorcontrib>Gu, Yile ; Shivakumar, Prashanth Gurunath ; Kolehmainen, Jari ; Gandhe, Ankur ; Rastrow, Ariya ; Bulyko, Ivan</creatorcontrib><description>Recent studies have found that model performance has a smooth power-law relationship, or scaling laws, with training data and model size, for a wide range of problems. These scaling laws allow one to choose nearly optimal data and model sizes. We study whether this scaling property is also applicable to second-pass rescoring, which is an important component of speech recognition systems. We focus on RescoreBERT as the rescoring model, which uses a pre-trained Transformer-based architecture fined tuned with an ASR discriminative loss. Using such a rescoring model, we show that the word error rate (WER) follows a scaling law for over two orders of magnitude as training data and model size increase. In addition, it is found that a pre-trained model would require less data than a randomly initialized model of the same size, representing effective data transferred from pre-training step. This effective data transferred is found to also follow a scaling law with the data and model size.</description><identifier>DOI: 10.48550/arxiv.2306.15815</identifier><language>eng</language><creationdate>2023-06</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2306.15815$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2306.15815$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Gu, Yile</creatorcontrib><creatorcontrib>Shivakumar, Prashanth Gurunath</creatorcontrib><creatorcontrib>Kolehmainen, Jari</creatorcontrib><creatorcontrib>Gandhe, Ankur</creatorcontrib><creatorcontrib>Rastrow, Ariya</creatorcontrib><creatorcontrib>Bulyko, Ivan</creatorcontrib><title>Scaling Laws for Discriminative Speech Recognition Rescoring Models</title><description>Recent studies have found that model performance has a smooth power-law relationship, or scaling laws, with training data and model size, for a wide range of problems. These scaling laws allow one to choose nearly optimal data and model sizes. We study whether this scaling property is also applicable to second-pass rescoring, which is an important component of speech recognition systems. We focus on RescoreBERT as the rescoring model, which uses a pre-trained Transformer-based architecture fined tuned with an ASR discriminative loss. Using such a rescoring model, we show that the word error rate (WER) follows a scaling law for over two orders of magnitude as training data and model size increase. In addition, it is found that a pre-trained model would require less data than a randomly initialized model of the same size, representing effective data transferred from pre-training step. This effective data transferred is found to also follow a scaling law with the data and model size.</description><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81OwzAQhH3pARUegBN-gQQ7_ol9ROFXSlWp7T3abuxiKY0ruyrw9iSF0-6OZkfzEXLPWSmNUuwR0ne4lJVguuTKcHVDmi3CEMYDbeErUx8TfQ4ZUziGEc7h4uj25Bx-0o3DeBjDOcRx2jPGND-tYu-GfEsWHobs7v7nkuxeX3bNe9Gu3z6ap7YAXaui7s0ee28lt06AZui1BIWam97oupL1JDNmK1CVQDsfaCeXlN7tudVGLMnDX-yVojtNJSH9dDNNd6URv_5ERHk</recordid><startdate>20230627</startdate><enddate>20230627</enddate><creator>Gu, Yile</creator><creator>Shivakumar, Prashanth Gurunath</creator><creator>Kolehmainen, Jari</creator><creator>Gandhe, Ankur</creator><creator>Rastrow, Ariya</creator><creator>Bulyko, Ivan</creator><scope>GOX</scope></search><sort><creationdate>20230627</creationdate><title>Scaling Laws for Discriminative Speech Recognition Rescoring Models</title><author>Gu, Yile ; Shivakumar, Prashanth Gurunath ; Kolehmainen, Jari ; Gandhe, Ankur ; Rastrow, Ariya ; Bulyko, Ivan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-7d8bcdf9419e3a60cf64a5c618d8672479e30092a523c99e30c960c44feb19683</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><toplevel>online_resources</toplevel><creatorcontrib>Gu, Yile</creatorcontrib><creatorcontrib>Shivakumar, Prashanth Gurunath</creatorcontrib><creatorcontrib>Kolehmainen, Jari</creatorcontrib><creatorcontrib>Gandhe, Ankur</creatorcontrib><creatorcontrib>Rastrow, Ariya</creatorcontrib><creatorcontrib>Bulyko, Ivan</creatorcontrib><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Gu, Yile</au><au>Shivakumar, Prashanth Gurunath</au><au>Kolehmainen, Jari</au><au>Gandhe, Ankur</au><au>Rastrow, Ariya</au><au>Bulyko, Ivan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Scaling Laws for Discriminative Speech Recognition Rescoring Models</atitle><date>2023-06-27</date><risdate>2023</risdate><abstract>Recent studies have found that model performance has a smooth power-law relationship, or scaling laws, with training data and model size, for a wide range of problems. These scaling laws allow one to choose nearly optimal data and model sizes. We study whether this scaling property is also applicable to second-pass rescoring, which is an important component of speech recognition systems. We focus on RescoreBERT as the rescoring model, which uses a pre-trained Transformer-based architecture fined tuned with an ASR discriminative loss. Using such a rescoring model, we show that the word error rate (WER) follows a scaling law for over two orders of magnitude as training data and model size increase. In addition, it is found that a pre-trained model would require less data than a randomly initialized model of the same size, representing effective data transferred from pre-training step. This effective data transferred is found to also follow a scaling law with the data and model size.</abstract><doi>10.48550/arxiv.2306.15815</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2306.15815
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2306_15815
source	arXiv.org
title	Scaling Laws for Discriminative Speech Recognition Rescoring Models
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T02%3A30%3A27IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Scaling%20Laws%20for%20Discriminative%20Speech%20Recognition%20Rescoring%20Models&rft.au=Gu,%20Yile&rft.date=2023-06-27&rft_id=info:doi/10.48550/arxiv.2306.15815&rft_dat=%3Carxiv_GOX%3E2306_15815%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true