Scaling Laws for Autoregressive Generative Modeling

We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$\leftrightarrow$text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Henighan, Tom, Kaplan, Jared, Katz, Mor, Chen, Mark, Hesse, Christopher, Jackson, Jacob, Jun, Heewoo, Brown, Tom B, Dhariwal, Prafulla, Gray, Scott, Hallacy, Chris, Mann, Benjamin, Radford, Alec, Ramesh, Aditya, Ryder, Nick, Ziegler, Daniel M, Schulman, John, Amodei, Dario, McCandlish, Sam
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Henighan, Tom
Kaplan, Jared
Katz, Mor
Chen, Mark
Hesse, Christopher
Jackson, Jacob
Jun, Heewoo
Brown, Tom B
Dhariwal, Prafulla
Gray, Scott
Hallacy, Chris
Mann, Benjamin
Radford, Alec
Ramesh, Aditya
Ryder, Nick
Ziegler, Daniel M
Schulman, John
Amodei, Dario
McCandlish, Sam
description We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$\leftrightarrow$text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains. The cross-entropy loss has an information theoretic interpretation as $S($True$) + D_{\mathrm{KL}}($True$||$Model$)$, and the empirical scaling laws suggest a prediction for both the true data distribution's entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an $8\times 8$ resolution, and we can forecast the model size needed to achieve any given reducible loss (ie $D_{\mathrm{KL}}$) in nats/image for other resolutions. We find a number of additional scaling laws in specific domains: (a) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question "Is a picture worth a thousand words?"; (b) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (c) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.
doi_str_mv 10.48550/arxiv.2010.14701
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2010_14701</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2010_14701</sourcerecordid><originalsourceid>FETCH-LOGICAL-a1151-86a2bd4a2571cbaf1848eede8d2d0548ca1fad4ddf99d876254ed32ea167c94b3</originalsourceid><addsrcrecordid>eNotjs2OgkAQhOfiwaAP4Gl5AXB6mIHhaIi6Jmw8qGfS0D2GxBUz-Ldvv4vuqSqVL5VPiBnIWFtj5Bz9s73HSv4NoDMJY5HsGjy152NY4qMPXefDxe3aeT567vv2zuGaz-zxOtSvjnhgJ2Lk8NTz9D8DcVgt98VnVG7Xm2JRRghgILIpqpo0KpNBU6MDqy0zsSVF0mjbIDgkTeTynGyWKqOZEsUIadbkuk4C8fH-fWlXF99-o_-pBv3qpZ_8AotPQA4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Scaling Laws for Autoregressive Generative Modeling</title><source>arXiv.org</source><creator>Henighan, Tom ; Kaplan, Jared ; Katz, Mor ; Chen, Mark ; Hesse, Christopher ; Jackson, Jacob ; Jun, Heewoo ; Brown, Tom B ; Dhariwal, Prafulla ; Gray, Scott ; Hallacy, Chris ; Mann, Benjamin ; Radford, Alec ; Ramesh, Aditya ; Ryder, Nick ; Ziegler, Daniel M ; Schulman, John ; Amodei, Dario ; McCandlish, Sam</creator><creatorcontrib>Henighan, Tom ; Kaplan, Jared ; Katz, Mor ; Chen, Mark ; Hesse, Christopher ; Jackson, Jacob ; Jun, Heewoo ; Brown, Tom B ; Dhariwal, Prafulla ; Gray, Scott ; Hallacy, Chris ; Mann, Benjamin ; Radford, Alec ; Ramesh, Aditya ; Ryder, Nick ; Ziegler, Daniel M ; Schulman, John ; Amodei, Dario ; McCandlish, Sam</creatorcontrib><description>We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$\leftrightarrow$text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains. The cross-entropy loss has an information theoretic interpretation as $S($True$) + D_{\mathrm{KL}}($True$||$Model$)$, and the empirical scaling laws suggest a prediction for both the true data distribution's entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an $8\times 8$ resolution, and we can forecast the model size needed to achieve any given reducible loss (ie $D_{\mathrm{KL}}$) in nats/image for other resolutions. We find a number of additional scaling laws in specific domains: (a) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question "Is a picture worth a thousand words?"; (b) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (c) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.</description><identifier>DOI: 10.48550/arxiv.2010.14701</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Learning</subject><creationdate>2020-10</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a1151-86a2bd4a2571cbaf1848eede8d2d0548ca1fad4ddf99d876254ed32ea167c94b3</citedby></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2010.14701$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2010.14701$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Henighan, Tom</creatorcontrib><creatorcontrib>Kaplan, Jared</creatorcontrib><creatorcontrib>Katz, Mor</creatorcontrib><creatorcontrib>Chen, Mark</creatorcontrib><creatorcontrib>Hesse, Christopher</creatorcontrib><creatorcontrib>Jackson, Jacob</creatorcontrib><creatorcontrib>Jun, Heewoo</creatorcontrib><creatorcontrib>Brown, Tom B</creatorcontrib><creatorcontrib>Dhariwal, Prafulla</creatorcontrib><creatorcontrib>Gray, Scott</creatorcontrib><creatorcontrib>Hallacy, Chris</creatorcontrib><creatorcontrib>Mann, Benjamin</creatorcontrib><creatorcontrib>Radford, Alec</creatorcontrib><creatorcontrib>Ramesh, Aditya</creatorcontrib><creatorcontrib>Ryder, Nick</creatorcontrib><creatorcontrib>Ziegler, Daniel M</creatorcontrib><creatorcontrib>Schulman, John</creatorcontrib><creatorcontrib>Amodei, Dario</creatorcontrib><creatorcontrib>McCandlish, Sam</creatorcontrib><title>Scaling Laws for Autoregressive Generative Modeling</title><description>We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$\leftrightarrow$text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains. The cross-entropy loss has an information theoretic interpretation as $S($True$) + D_{\mathrm{KL}}($True$||$Model$)$, and the empirical scaling laws suggest a prediction for both the true data distribution's entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an $8\times 8$ resolution, and we can forecast the model size needed to achieve any given reducible loss (ie $D_{\mathrm{KL}}$) in nats/image for other resolutions. We find a number of additional scaling laws in specific domains: (a) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question "Is a picture worth a thousand words?"; (b) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (c) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotjs2OgkAQhOfiwaAP4Gl5AXB6mIHhaIi6Jmw8qGfS0D2GxBUz-Ldvv4vuqSqVL5VPiBnIWFtj5Bz9s73HSv4NoDMJY5HsGjy152NY4qMPXefDxe3aeT567vv2zuGaz-zxOtSvjnhgJ2Lk8NTz9D8DcVgt98VnVG7Xm2JRRghgILIpqpo0KpNBU6MDqy0zsSVF0mjbIDgkTeTynGyWKqOZEsUIadbkuk4C8fH-fWlXF99-o_-pBv3qpZ_8AotPQA4</recordid><startdate>20201027</startdate><enddate>20201027</enddate><creator>Henighan, Tom</creator><creator>Kaplan, Jared</creator><creator>Katz, Mor</creator><creator>Chen, Mark</creator><creator>Hesse, Christopher</creator><creator>Jackson, Jacob</creator><creator>Jun, Heewoo</creator><creator>Brown, Tom B</creator><creator>Dhariwal, Prafulla</creator><creator>Gray, Scott</creator><creator>Hallacy, Chris</creator><creator>Mann, Benjamin</creator><creator>Radford, Alec</creator><creator>Ramesh, Aditya</creator><creator>Ryder, Nick</creator><creator>Ziegler, Daniel M</creator><creator>Schulman, John</creator><creator>Amodei, Dario</creator><creator>McCandlish, Sam</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20201027</creationdate><title>Scaling Laws for Autoregressive Generative Modeling</title><author>Henighan, Tom ; Kaplan, Jared ; Katz, Mor ; Chen, Mark ; Hesse, Christopher ; Jackson, Jacob ; Jun, Heewoo ; Brown, Tom B ; Dhariwal, Prafulla ; Gray, Scott ; Hallacy, Chris ; Mann, Benjamin ; Radford, Alec ; Ramesh, Aditya ; Ryder, Nick ; Ziegler, Daniel M ; Schulman, John ; Amodei, Dario ; McCandlish, Sam</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a1151-86a2bd4a2571cbaf1848eede8d2d0548ca1fad4ddf99d876254ed32ea167c94b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Henighan, Tom</creatorcontrib><creatorcontrib>Kaplan, Jared</creatorcontrib><creatorcontrib>Katz, Mor</creatorcontrib><creatorcontrib>Chen, Mark</creatorcontrib><creatorcontrib>Hesse, Christopher</creatorcontrib><creatorcontrib>Jackson, Jacob</creatorcontrib><creatorcontrib>Jun, Heewoo</creatorcontrib><creatorcontrib>Brown, Tom B</creatorcontrib><creatorcontrib>Dhariwal, Prafulla</creatorcontrib><creatorcontrib>Gray, Scott</creatorcontrib><creatorcontrib>Hallacy, Chris</creatorcontrib><creatorcontrib>Mann, Benjamin</creatorcontrib><creatorcontrib>Radford, Alec</creatorcontrib><creatorcontrib>Ramesh, Aditya</creatorcontrib><creatorcontrib>Ryder, Nick</creatorcontrib><creatorcontrib>Ziegler, Daniel M</creatorcontrib><creatorcontrib>Schulman, John</creatorcontrib><creatorcontrib>Amodei, Dario</creatorcontrib><creatorcontrib>McCandlish, Sam</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Henighan, Tom</au><au>Kaplan, Jared</au><au>Katz, Mor</au><au>Chen, Mark</au><au>Hesse, Christopher</au><au>Jackson, Jacob</au><au>Jun, Heewoo</au><au>Brown, Tom B</au><au>Dhariwal, Prafulla</au><au>Gray, Scott</au><au>Hallacy, Chris</au><au>Mann, Benjamin</au><au>Radford, Alec</au><au>Ramesh, Aditya</au><au>Ryder, Nick</au><au>Ziegler, Daniel M</au><au>Schulman, John</au><au>Amodei, Dario</au><au>McCandlish, Sam</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Scaling Laws for Autoregressive Generative Modeling</atitle><date>2020-10-27</date><risdate>2020</risdate><abstract>We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$\leftrightarrow$text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains. The cross-entropy loss has an information theoretic interpretation as $S($True$) + D_{\mathrm{KL}}($True$||$Model$)$, and the empirical scaling laws suggest a prediction for both the true data distribution's entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an $8\times 8$ resolution, and we can forecast the model size needed to achieve any given reducible loss (ie $D_{\mathrm{KL}}$) in nats/image for other resolutions. We find a number of additional scaling laws in specific domains: (a) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question "Is a picture worth a thousand words?"; (b) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (c) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.</abstract><doi>10.48550/arxiv.2010.14701</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2010.14701
ispartof
issn
language eng
recordid cdi_arxiv_primary_2010_14701
source arXiv.org
subjects Computer Science - Computation and Language
Computer Science - Computer Vision and Pattern Recognition
Computer Science - Learning
title Scaling Laws for Autoregressive Generative Modeling
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-02T06%3A21%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Scaling%20Laws%20for%20Autoregressive%20Generative%20Modeling&rft.au=Henighan,%20Tom&rft.date=2020-10-27&rft_id=info:doi/10.48550/arxiv.2010.14701&rft_dat=%3Carxiv_GOX%3E2010_14701%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true