Sparse is Enough in Fine-tuning Pre-trained Large Language Models
With the prevalence of pre-training-fine-tuning paradigm, how to efficiently adapt the pre-trained model to the downstream tasks has been an intriguing issue. Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed for low-cost adaptation. Although PEFT has demonstrated effectiveness and b...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | With the prevalence of pre-training-fine-tuning paradigm, how to efficiently
adapt the pre-trained model to the downstream tasks has been an intriguing
issue. Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed for
low-cost adaptation. Although PEFT has demonstrated effectiveness and been
widely applied, the underlying principles are still unclear. In this paper, we
adopt the PAC-Bayesian generalization error bound, viewing pre-training as a
shift of prior distribution which leads to a tighter bound for generalization
error. We validate this shift from the perspectives of oscillations in the loss
landscape and the quasi-sparsity in gradient distribution. Based on this, we
propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment
Fine-Tuning (SIFT), and validate its effectiveness on a range of tasks
including the GLUE Benchmark and Instruction-tuning. The code is accessible at
https://github.com/song-wx/SIFT/. |
---|---|
DOI: | 10.48550/arxiv.2312.11875 |