Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate

Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate - in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previo...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Nacson, Mor Shpigel, Srebro, Nathan, Soudry, Daniel
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning Statistics - Machine Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Nacson, Mor Shpigel Srebro, Nathan Soudry, Daniel
description	Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate - in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous works assumed either a vanishing learning rate, iterate averaging, or loss assumptions that do not hold for monotone loss functions used for classification, such as the logistic loss. We prove our result on a fixed dataset, both for sampling with or without replacement. Furthermore, for logistic loss (and similar exponentially-tailed losses), we prove that with SGD the weight vector converges in direction to the $L_2$ max margin vector as $O(1/\log(t))$ for almost all separable datasets, and the loss converges as $O(1/t)$ - similarly to gradient descent. Lastly, we examine the case of a fixed learning rate proportional to the minibatch size. We prove that in this case, the asymptotic convergence rate of SGD (with replacement) does not depend on the minibatch size in terms of epochs, if the support vectors span the data. These results may suggest an explanation to similar behaviors observed in deep networks, when trained with SGD.
doi_str_mv	10.48550/arxiv.1806.01796
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_1806_01796</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1806_01796</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-7727240f3a4cc13c053755ac68be9a46867c4242f72fe667743f27d0caabe6193</originalsourceid><addsrcrecordid>eNotz7tOwzAUgGEvDKjwAEz4BRIc307ChtILSJGQaDeG6MQ5aS0Vp3KsEt4etTD92y99jD0UItelMeIJ4-zPeVEKm4sCKnvLPrdpdAecknd8E7H3FBJf0uQuHQPf0gkjdkfiS0z4zFczusTrMZwp7ik44t8-HTjytZ-p5w1hDD7s-QcmumM3Ax4nuv_vgu3Wq139mjXvm7f6pcnQgs0AJEgtBoXauUI5YRQYg86WHVWobWnBaanlAHIgawG0GiT0wiF2ZItKLdjj3_aqa0_Rf2H8aS_K9qpUv1v3TEA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate</title><source>arXiv.org</source><creator>Nacson, Mor Shpigel ; Srebro, Nathan ; Soudry, Daniel</creator><creatorcontrib>Nacson, Mor Shpigel ; Srebro, Nathan ; Soudry, Daniel</creatorcontrib><description>Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate - in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous works assumed either a vanishing learning rate, iterate averaging, or loss assumptions that do not hold for monotone loss functions used for classification, such as the logistic loss. We prove our result on a fixed dataset, both for sampling with or without replacement. Furthermore, for logistic loss (and similar exponentially-tailed losses), we prove that with SGD the weight vector converges in direction to the $L_2$ max margin vector as $O(1/\log(t))$ for almost all separable datasets, and the loss converges as $O(1/t)$ - similarly to gradient descent. Lastly, we examine the case of a fixed learning rate proportional to the minibatch size. We prove that in this case, the asymptotic convergence rate of SGD (with replacement) does not depend on the minibatch size in terms of epochs, if the support vectors span the data. These results may suggest an explanation to similar behaviors observed in deep networks, when trained with SGD.</description><identifier>DOI: 10.48550/arxiv.1806.01796</identifier><language>eng</language><subject>Computer Science - Learning ; Statistics - Machine Learning</subject><creationdate>2018-06</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/1806.01796$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.1806.01796$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Nacson, Mor Shpigel</creatorcontrib><creatorcontrib>Srebro, Nathan</creatorcontrib><creatorcontrib>Soudry, Daniel</creatorcontrib><title>Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate</title><description>Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate - in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous works assumed either a vanishing learning rate, iterate averaging, or loss assumptions that do not hold for monotone loss functions used for classification, such as the logistic loss. We prove our result on a fixed dataset, both for sampling with or without replacement. Furthermore, for logistic loss (and similar exponentially-tailed losses), we prove that with SGD the weight vector converges in direction to the $L_2$ max margin vector as $O(1/\log(t))$ for almost all separable datasets, and the loss converges as $O(1/t)$ - similarly to gradient descent. Lastly, we examine the case of a fixed learning rate proportional to the minibatch size. We prove that in this case, the asymptotic convergence rate of SGD (with replacement) does not depend on the minibatch size in terms of epochs, if the support vectors span the data. These results may suggest an explanation to similar behaviors observed in deep networks, when trained with SGD.</description><subject>Computer Science - Learning</subject><subject>Statistics - Machine Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz7tOwzAUgGEvDKjwAEz4BRIc307ChtILSJGQaDeG6MQ5aS0Vp3KsEt4etTD92y99jD0UItelMeIJ4-zPeVEKm4sCKnvLPrdpdAecknd8E7H3FBJf0uQuHQPf0gkjdkfiS0z4zFczusTrMZwp7ik44t8-HTjytZ-p5w1hDD7s-QcmumM3Ax4nuv_vgu3Wq139mjXvm7f6pcnQgs0AJEgtBoXauUI5YRQYg86WHVWobWnBaanlAHIgawG0GiT0wiF2ZItKLdjj3_aqa0_Rf2H8aS_K9qpUv1v3TEA</recordid><startdate>20180605</startdate><enddate>20180605</enddate><creator>Nacson, Mor Shpigel</creator><creator>Srebro, Nathan</creator><creator>Soudry, Daniel</creator><scope>AKY</scope><scope>EPD</scope><scope>GOX</scope></search><sort><creationdate>20180605</creationdate><title>Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate</title><author>Nacson, Mor Shpigel ; Srebro, Nathan ; Soudry, Daniel</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-7727240f3a4cc13c053755ac68be9a46867c4242f72fe667743f27d0caabe6193</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Computer Science - Learning</topic><topic>Statistics - Machine Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Nacson, Mor Shpigel</creatorcontrib><creatorcontrib>Srebro, Nathan</creatorcontrib><creatorcontrib>Soudry, Daniel</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Statistics</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Nacson, Mor Shpigel</au><au>Srebro, Nathan</au><au>Soudry, Daniel</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate</atitle><date>2018-06-05</date><risdate>2018</risdate><abstract>Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate - in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous works assumed either a vanishing learning rate, iterate averaging, or loss assumptions that do not hold for monotone loss functions used for classification, such as the logistic loss. We prove our result on a fixed dataset, both for sampling with or without replacement. Furthermore, for logistic loss (and similar exponentially-tailed losses), we prove that with SGD the weight vector converges in direction to the $L_2$ max margin vector as $O(1/\log(t))$ for almost all separable datasets, and the loss converges as $O(1/t)$ - similarly to gradient descent. Lastly, we examine the case of a fixed learning rate proportional to the minibatch size. We prove that in this case, the asymptotic convergence rate of SGD (with replacement) does not depend on the minibatch size in terms of epochs, if the support vectors span the data. These results may suggest an explanation to similar behaviors observed in deep networks, when trained with SGD.</abstract><doi>10.48550/arxiv.1806.01796</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.1806.01796
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_1806_01796
source	arXiv.org
subjects	Computer Science - Learning Statistics - Machine Learning
title	Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-03T21%3A26%3A40IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Stochastic%20Gradient%20Descent%20on%20Separable%20Data:%20Exact%20Convergence%20with%20a%20Fixed%20Learning%20Rate&rft.au=Nacson,%20Mor%20Shpigel&rft.date=2018-06-05&rft_id=info:doi/10.48550/arxiv.1806.01796&rft_dat=%3Carxiv_GOX%3E1806_01796%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true