Scaling Laws in Linear Regression: Compute, Parameters, and Data

Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the varia...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Lin, Licong, Wu, Jingfeng, Kakade, Sham M, Bartlett, Peter L, Lee, Jason D
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Learning Mathematics - Statistics Theory Statistics - Machine Learning Statistics - Theory
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Lin, Licong Wu, Jingfeng Kakade, Sham M Bartlett, Peter L Lee, Jason D
description	Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the variance error increases with model size. This disagrees with the general form of neural scaling laws, which predict that increasing model size monotonically improves performance. We study the theory of scaling laws in an infinite dimensional linear regression setup. Specifically, we consider a model with $M$ parameters as a linear function of sketched covariates. The model is trained by one-pass stochastic gradient descent (SGD) using $N$ data. Assuming the optimal parameter satisfies a Gaussian prior and the data covariance matrix has a power-law spectrum of degree $a>1$, we show that the reducible part of the test error is $\Theta(M^{-(a-1)} + N^{-(a-1)/a})$. The variance error, which increases with $M$, is dominated by the other errors due to the implicit regularization of SGD, thus disappearing from the bound. Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
doi_str_mv	10.48550/arxiv.2406.08466
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2406_08466</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2406_08466</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-795235f618d0409bd67fcb20cf8fffbdfd3690ceb79a454835d5af330be774a83</originalsourceid><addsrcrecordid>eNotz7tOwzAUgGEvDKjwAEz4AZpwGl_DBAqXIkUCQffoOD6uLDVuZYfb2yMK07_90sfYxQpqaZWCK8xf8aNuJOgarNT6lN28jbiLact7_Cw8Jt7HRJj5K20zlRL36Zp3--nwPtOSv2DGiWbKZckxeX6HM56xk4C7Quf_XbDNw_2mW1f98-NTd9tXqI2uTKsaoYJeWQ8SWue1CaNrYAw2hOB88EK3MJIzLUolrVBeYRACHBkj0YoFu_zbHgnDIccJ8_fwSxmOFPEDe1ZDJA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Scaling Laws in Linear Regression: Compute, Parameters, and Data</title><source>arXiv.org</source><creator>Lin, Licong ; Wu, Jingfeng ; Kakade, Sham M ; Bartlett, Peter L ; Lee, Jason D</creator><creatorcontrib>Lin, Licong ; Wu, Jingfeng ; Kakade, Sham M ; Bartlett, Peter L ; Lee, Jason D</creatorcontrib><description>Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the variance error increases with model size. This disagrees with the general form of neural scaling laws, which predict that increasing model size monotonically improves performance. We study the theory of scaling laws in an infinite dimensional linear regression setup. Specifically, we consider a model with $M$ parameters as a linear function of sketched covariates. The model is trained by one-pass stochastic gradient descent (SGD) using $N$ data. Assuming the optimal parameter satisfies a Gaussian prior and the data covariance matrix has a power-law spectrum of degree $a>1$, we show that the reducible part of the test error is $\Theta(M^{-(a-1)} + N^{-(a-1)/a})$. The variance error, which increases with $M$, is dominated by the other errors due to the implicit regularization of SGD, thus disappearing from the bound. Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.</description><identifier>DOI: 10.48550/arxiv.2406.08466</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Learning ; Mathematics - Statistics Theory ; Statistics - Machine Learning ; Statistics - Theory</subject><creationdate>2024-06</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2406.08466$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2406.08466$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Lin, Licong</creatorcontrib><creatorcontrib>Wu, Jingfeng</creatorcontrib><creatorcontrib>Kakade, Sham M</creatorcontrib><creatorcontrib>Bartlett, Peter L</creatorcontrib><creatorcontrib>Lee, Jason D</creatorcontrib><title>Scaling Laws in Linear Regression: Compute, Parameters, and Data</title><description>Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the variance error increases with model size. This disagrees with the general form of neural scaling laws, which predict that increasing model size monotonically improves performance. We study the theory of scaling laws in an infinite dimensional linear regression setup. Specifically, we consider a model with $M$ parameters as a linear function of sketched covariates. The model is trained by one-pass stochastic gradient descent (SGD) using $N$ data. Assuming the optimal parameter satisfies a Gaussian prior and the data covariance matrix has a power-law spectrum of degree $a>1$, we show that the reducible part of the test error is $\Theta(M^{-(a-1)} + N^{-(a-1)/a})$. The variance error, which increases with $M$, is dominated by the other errors due to the implicit regularization of SGD, thus disappearing from the bound. Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Learning</subject><subject>Mathematics - Statistics Theory</subject><subject>Statistics - Machine Learning</subject><subject>Statistics - Theory</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz7tOwzAUgGEvDKjwAEz4AZpwGl_DBAqXIkUCQffoOD6uLDVuZYfb2yMK07_90sfYxQpqaZWCK8xf8aNuJOgarNT6lN28jbiLact7_Cw8Jt7HRJj5K20zlRL36Zp3--nwPtOSv2DGiWbKZckxeX6HM56xk4C7Quf_XbDNw_2mW1f98-NTd9tXqI2uTKsaoYJeWQ8SWue1CaNrYAw2hOB88EK3MJIzLUolrVBeYRACHBkj0YoFu_zbHgnDIccJ8_fwSxmOFPEDe1ZDJA</recordid><startdate>20240612</startdate><enddate>20240612</enddate><creator>Lin, Licong</creator><creator>Wu, Jingfeng</creator><creator>Kakade, Sham M</creator><creator>Bartlett, Peter L</creator><creator>Lee, Jason D</creator><scope>AKY</scope><scope>AKZ</scope><scope>EPD</scope><scope>GOX</scope></search><sort><creationdate>20240612</creationdate><title>Scaling Laws in Linear Regression: Compute, Parameters, and Data</title><author>Lin, Licong ; Wu, Jingfeng ; Kakade, Sham M ; Bartlett, Peter L ; Lee, Jason D</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-795235f618d0409bd67fcb20cf8fffbdfd3690ceb79a454835d5af330be774a83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Learning</topic><topic>Mathematics - Statistics Theory</topic><topic>Statistics - Machine Learning</topic><topic>Statistics - Theory</topic><toplevel>online_resources</toplevel><creatorcontrib>Lin, Licong</creatorcontrib><creatorcontrib>Wu, Jingfeng</creatorcontrib><creatorcontrib>Kakade, Sham M</creatorcontrib><creatorcontrib>Bartlett, Peter L</creatorcontrib><creatorcontrib>Lee, Jason D</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Mathematics</collection><collection>arXiv Statistics</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Lin, Licong</au><au>Wu, Jingfeng</au><au>Kakade, Sham M</au><au>Bartlett, Peter L</au><au>Lee, Jason D</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Scaling Laws in Linear Regression: Compute, Parameters, and Data</atitle><date>2024-06-12</date><risdate>2024</risdate><abstract>Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the variance error increases with model size. This disagrees with the general form of neural scaling laws, which predict that increasing model size monotonically improves performance. We study the theory of scaling laws in an infinite dimensional linear regression setup. Specifically, we consider a model with $M$ parameters as a linear function of sketched covariates. The model is trained by one-pass stochastic gradient descent (SGD) using $N$ data. Assuming the optimal parameter satisfies a Gaussian prior and the data covariance matrix has a power-law spectrum of degree $a>1$, we show that the reducible part of the test error is $\Theta(M^{-(a-1)} + N^{-(a-1)/a})$. The variance error, which increases with $M$, is dominated by the other errors due to the implicit regularization of SGD, thus disappearing from the bound. Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.</abstract><doi>10.48550/arxiv.2406.08466</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2406.08466
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2406_08466
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Learning Mathematics - Statistics Theory Statistics - Machine Learning Statistics - Theory
title	Scaling Laws in Linear Regression: Compute, Parameters, and Data
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T01%3A59%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Scaling%20Laws%20in%20Linear%20Regression:%20Compute,%20Parameters,%20and%20Data&rft.au=Lin,%20Licong&rft.date=2024-06-12&rft_id=info:doi/10.48550/arxiv.2406.08466&rft_dat=%3Carxiv_GOX%3E2406_08466%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true