Effects of Scale on Language Model Robustness
Language models exhibit scaling laws, whereby increasing model and dataset size yields predictable decreases in negative log likelihood, unlocking a dazzling array of capabilities. This phenomenon spurs many companies to train ever larger models in pursuit of ever improved performance. Yet, these mo...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Language models exhibit scaling laws, whereby increasing model and dataset
size yields predictable decreases in negative log likelihood, unlocking a
dazzling array of capabilities. This phenomenon spurs many companies to train
ever larger models in pursuit of ever improved performance. Yet, these models
are vulnerable to adversarial inputs such as ``jailbreaks'' and prompt
injections that induce models to perform undesired behaviors, posing a growing
risk as models become more capable. Prior work indicates that computer vision
models become more robust with model and data scaling, raising the question:
does language model robustness also improve with scale?
We study this question empirically in the classification setting, finding
that without explicit defense training, larger models tend to be modestly more
robust on most tasks, though the effect is not reliable. Even with the
advantage conferred by scale, undefended models remain easy to attack in
absolute terms, and we thus turn our attention to explicitly training models
for adversarial robustness, which we show to be a much more compute-efficient
defense than scaling model size alone. In this setting, we also observe that
adversarially trained larger models generalize faster and better to modified
attacks not seen during training when compared with smaller models. Finally, we
analyze the offense/defense balance of increasing compute, finding parity in
some settings and an advantage for offense in others, suggesting that
adversarial training alone is not sufficient to solve robustness, even at
greater model scales. |
---|---|
DOI: | 10.48550/arxiv.2407.18213 |