NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models
Structured pruning methods have proven effective in reducing the model size and accelerating inference speed in various network architectures such as Transformers. Despite the versatility of encoder-decoder models in numerous NLP tasks, the structured pruning methods on such models are relatively le...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Structured pruning methods have proven effective in reducing the model size
and accelerating inference speed in various network architectures such as
Transformers. Despite the versatility of encoder-decoder models in numerous NLP
tasks, the structured pruning methods on such models are relatively less
explored compared to encoder-only models. In this study, we investigate the
behavior of the structured pruning of the encoder-decoder models in the
decoupled pruning perspective of the encoder and decoder component,
respectively. Our findings highlight two insights: (1) the number of decoder
layers is the dominant factor of inference speed, and (2) low sparsity in the
pruned encoder network enhances generation quality. Motivated by these
findings, we propose a simple and effective framework, NASH, that narrows the
encoder and shortens the decoder networks of encoder-decoder models. Extensive
experiments on diverse generation and inference tasks validate the
effectiveness of our method in both speedup and output quality. |
---|---|
DOI: | 10.48550/arxiv.2310.10054 |