Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate
Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate - in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previo...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Stochastic Gradient Descent (SGD) is a central tool in machine learning. We
prove that SGD converges to zero loss, even with a fixed (non-vanishing)
learning rate - in the special case of homogeneous linear classifiers with
smooth monotone loss functions, optimized on linearly separable data. Previous
works assumed either a vanishing learning rate, iterate averaging, or loss
assumptions that do not hold for monotone loss functions used for
classification, such as the logistic loss. We prove our result on a fixed
dataset, both for sampling with or without replacement. Furthermore, for
logistic loss (and similar exponentially-tailed losses), we prove that with SGD
the weight vector converges in direction to the $L_2$ max margin vector as
$O(1/\log(t))$ for almost all separable datasets, and the loss converges as
$O(1/t)$ - similarly to gradient descent. Lastly, we examine the case of a
fixed learning rate proportional to the minibatch size. We prove that in this
case, the asymptotic convergence rate of SGD (with replacement) does not depend
on the minibatch size in terms of epochs, if the support vectors span the data.
These results may suggest an explanation to similar behaviors observed in deep
networks, when trained with SGD. |
---|---|
DOI: | 10.48550/arxiv.1806.01796 |