On the Power of Differentiable Learning versus PAC and SQ Learning
We study the power of learning via mini-batch stochastic gradient descent (SGD) on the population loss, and batch Gradient Descent (GD) on the empirical loss, of a differentiable model or neural network, and ask what learning problems can be learnt using these paradigms. We show that SGD and GD can...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We study the power of learning via mini-batch stochastic gradient descent
(SGD) on the population loss, and batch Gradient Descent (GD) on the empirical
loss, of a differentiable model or neural network, and ask what learning
problems can be learnt using these paradigms. We show that SGD and GD can
always simulate learning with statistical queries (SQ), but their ability to go
beyond that depends on the precision $\rho$ of the gradient calculations
relative to the minibatch size $b$ (for SGD) and sample size $m$ (for GD). With
fine enough precision relative to minibatch size, namely when $b \rho$ is small
enough, SGD can go beyond SQ learning and simulate any sample-based learning
algorithm and thus its learning power is equivalent to that of PAC learning;
this extends prior work that achieved this result for $b=1$. Similarly, with
fine enough precision relative to the sample size $m$, GD can also simulate any
sample-based learning algorithm based on $m$ samples. In particular, with
polynomially many bits of precision (i.e. when $\rho$ is exponentially small),
SGD and GD can both simulate PAC learning regardless of the mini-batch size. On
the other hand, when $b \rho^2$ is large enough, the power of SGD is equivalent
to that of SQ learning. |
---|---|
DOI: | 10.48550/arxiv.2108.04190 |