Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate

Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate - in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previo...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Nacson, Mor Shpigel, Srebro, Nathan, Soudry, Daniel
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Nacson, Mor Shpigel
Srebro, Nathan
Soudry, Daniel
description Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate - in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous works assumed either a vanishing learning rate, iterate averaging, or loss assumptions that do not hold for monotone loss functions used for classification, such as the logistic loss. We prove our result on a fixed dataset, both for sampling with or without replacement. Furthermore, for logistic loss (and similar exponentially-tailed losses), we prove that with SGD the weight vector converges in direction to the $L_2$ max margin vector as $O(1/\log(t))$ for almost all separable datasets, and the loss converges as $O(1/t)$ - similarly to gradient descent. Lastly, we examine the case of a fixed learning rate proportional to the minibatch size. We prove that in this case, the asymptotic convergence rate of SGD (with replacement) does not depend on the minibatch size in terms of epochs, if the support vectors span the data. These results may suggest an explanation to similar behaviors observed in deep networks, when trained with SGD.
doi_str_mv 10.48550/arxiv.1806.01796
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_1806_01796</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1806_01796</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-7727240f3a4cc13c053755ac68be9a46867c4242f72fe667743f27d0caabe6193</originalsourceid><addsrcrecordid>eNotz7tOwzAUgGEvDKjwAEz4BRIc307ChtILSJGQaDeG6MQ5aS0Vp3KsEt4etTD92y99jD0UItelMeIJ4-zPeVEKm4sCKnvLPrdpdAecknd8E7H3FBJf0uQuHQPf0gkjdkfiS0z4zFczusTrMZwp7ik44t8-HTjytZ-p5w1hDD7s-QcmumM3Ax4nuv_vgu3Wq139mjXvm7f6pcnQgs0AJEgtBoXauUI5YRQYg86WHVWobWnBaanlAHIgawG0GiT0wiF2ZItKLdjj3_aqa0_Rf2H8aS_K9qpUv1v3TEA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate</title><source>arXiv.org</source><creator>Nacson, Mor Shpigel ; Srebro, Nathan ; Soudry, Daniel</creator><creatorcontrib>Nacson, Mor Shpigel ; Srebro, Nathan ; Soudry, Daniel</creatorcontrib><description>Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate - in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous works assumed either a vanishing learning rate, iterate averaging, or loss assumptions that do not hold for monotone loss functions used for classification, such as the logistic loss. We prove our result on a fixed dataset, both for sampling with or without replacement. Furthermore, for logistic loss (and similar exponentially-tailed losses), we prove that with SGD the weight vector converges in direction to the $L_2$ max margin vector as $O(1/\log(t))$ for almost all separable datasets, and the loss converges as $O(1/t)$ - similarly to gradient descent. Lastly, we examine the case of a fixed learning rate proportional to the minibatch size. We prove that in this case, the asymptotic convergence rate of SGD (with replacement) does not depend on the minibatch size in terms of epochs, if the support vectors span the data. These results may suggest an explanation to similar behaviors observed in deep networks, when trained with SGD.</description><identifier>DOI: 10.48550/arxiv.1806.01796</identifier><language>eng</language><subject>Computer Science - Learning ; Statistics - Machine Learning</subject><creationdate>2018-06</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/1806.01796$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.1806.01796$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Nacson, Mor Shpigel</creatorcontrib><creatorcontrib>Srebro, Nathan</creatorcontrib><creatorcontrib>Soudry, Daniel</creatorcontrib><title>Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate</title><description>Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate - in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous works assumed either a vanishing learning rate, iterate averaging, or loss assumptions that do not hold for monotone loss functions used for classification, such as the logistic loss. We prove our result on a fixed dataset, both for sampling with or without replacement. Furthermore, for logistic loss (and similar exponentially-tailed losses), we prove that with SGD the weight vector converges in direction to the $L_2$ max margin vector as $O(1/\log(t))$ for almost all separable datasets, and the loss converges as $O(1/t)$ - similarly to gradient descent. Lastly, we examine the case of a fixed learning rate proportional to the minibatch size. We prove that in this case, the asymptotic convergence rate of SGD (with replacement) does not depend on the minibatch size in terms of epochs, if the support vectors span the data. These results may suggest an explanation to similar behaviors observed in deep networks, when trained with SGD.</description><subject>Computer Science - Learning</subject><subject>Statistics - Machine Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz7tOwzAUgGEvDKjwAEz4BRIc307ChtILSJGQaDeG6MQ5aS0Vp3KsEt4etTD92y99jD0UItelMeIJ4-zPeVEKm4sCKnvLPrdpdAecknd8E7H3FBJf0uQuHQPf0gkjdkfiS0z4zFczusTrMZwp7ik44t8-HTjytZ-p5w1hDD7s-QcmumM3Ax4nuv_vgu3Wq139mjXvm7f6pcnQgs0AJEgtBoXauUI5YRQYg86WHVWobWnBaanlAHIgawG0GiT0wiF2ZItKLdjj3_aqa0_Rf2H8aS_K9qpUv1v3TEA</recordid><startdate>20180605</startdate><enddate>20180605</enddate><creator>Nacson, Mor Shpigel</creator><creator>Srebro, Nathan</creator><creator>Soudry, Daniel</creator><scope>AKY</scope><scope>EPD</scope><scope>GOX</scope></search><sort><creationdate>20180605</creationdate><title>Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate</title><author>Nacson, Mor Shpigel ; Srebro, Nathan ; Soudry, Daniel</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-7727240f3a4cc13c053755ac68be9a46867c4242f72fe667743f27d0caabe6193</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Computer Science - Learning</topic><topic>Statistics - Machine Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Nacson, Mor Shpigel</creatorcontrib><creatorcontrib>Srebro, Nathan</creatorcontrib><creatorcontrib>Soudry, Daniel</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Statistics</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Nacson, Mor Shpigel</au><au>Srebro, Nathan</au><au>Soudry, Daniel</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate</atitle><date>2018-06-05</date><risdate>2018</risdate><abstract>Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate - in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous works assumed either a vanishing learning rate, iterate averaging, or loss assumptions that do not hold for monotone loss functions used for classification, such as the logistic loss. We prove our result on a fixed dataset, both for sampling with or without replacement. Furthermore, for logistic loss (and similar exponentially-tailed losses), we prove that with SGD the weight vector converges in direction to the $L_2$ max margin vector as $O(1/\log(t))$ for almost all separable datasets, and the loss converges as $O(1/t)$ - similarly to gradient descent. Lastly, we examine the case of a fixed learning rate proportional to the minibatch size. We prove that in this case, the asymptotic convergence rate of SGD (with replacement) does not depend on the minibatch size in terms of epochs, if the support vectors span the data. These results may suggest an explanation to similar behaviors observed in deep networks, when trained with SGD.</abstract><doi>10.48550/arxiv.1806.01796</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.1806.01796
ispartof
issn
language eng
recordid cdi_arxiv_primary_1806_01796
source arXiv.org
subjects Computer Science - Learning
Statistics - Machine Learning
title Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-03T21%3A26%3A40IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Stochastic%20Gradient%20Descent%20on%20Separable%20Data:%20Exact%20Convergence%20with%20a%20Fixed%20Learning%20Rate&rft.au=Nacson,%20Mor%20Shpigel&rft.date=2018-06-05&rft_id=info:doi/10.48550/arxiv.1806.01796&rft_dat=%3Carxiv_GOX%3E1806_01796%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true