An Empirical Study on the Transferability of Transformer Modules in Parameter-Efficient Fine-Tuning
Parameter-efficient fine-tuning approaches have recently garnered a lot of attention. Having considerably lower number of trainable weights, these methods can bring about scalability and computational effectiveness. In this paper, we look for optimal sub-networks and investigate the capability of di...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Parameter-efficient fine-tuning approaches have recently garnered a lot of
attention. Having considerably lower number of trainable weights, these methods
can bring about scalability and computational effectiveness. In this paper, we
look for optimal sub-networks and investigate the capability of different
transformer modules in transferring knowledge from a pre-trained model to a
downstream task. Our empirical results suggest that every transformer module in
BERT can act as a winning ticket: fine-tuning each specific module while
keeping the rest of the network frozen can lead to comparable performance to
the full fine-tuning. Among different modules, LayerNorms exhibit the best
capacity for knowledge transfer with limited trainable weights, to the extent
that, with only 0.003% of all parameters in the layer-wise analysis, they show
acceptable performance on various target tasks. On the reasons behind their
effectiveness, we argue that their notable performance could be attributed to
their high-magnitude weights compared to that of the other modules in the
pre-trained BERT. |
---|---|
DOI: | 10.48550/arxiv.2302.00378 |