Differentiable Slimming for Memory-Efficient Transformers

Transformer models are continuously achieving state-of-the-art performance on a wide range of benchmarks. To meet demanding performance targets, the number of model parameters is continuously increased. As a result, state-of-the-art Transformers require substantial computational resources prohibitin...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE embedded systems letters 2023-12, Vol.15 (4), p.186-189
Hauptverfasser:	Penkov, Nikolay, Balaskas, Konstantinos, Rapp, Martin, Henkel, Joerg
Format:	Artikel
Sprache:	eng
Schlagworte:	Computational modeling Computer architecture Embedded devices Heating systems Information flow Logic gates Mathematical models memory efficiency Memory management natural language processing Parameters Performance degradation Performance evaluation Pruning Training Transformers
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Transformer models are continuously achieving state-of-the-art performance on a wide range of benchmarks. To meet demanding performance targets, the number of model parameters is continuously increased. As a result, state-of-the-art Transformers require substantial computational resources prohibiting their deployment on consumer-grade hardware. In the literature, overparameterized Transformers are successfully reduced in size with the help of pruning strategies. Existing works lack the ability to optimize the full architecture, without incurring significant overheads, in a fully differentiable manner. Our work proposes a single-stage approach for training a Transformer for memory-efficient inference and various resource-constrained scenarios. Transformer blocks are extended with trainable gate parameters, which attribute importance and control information flow. Their integration into a differentiable pruning-aware training scheme allows the extraction of extremely sparse subnetworks at runtime, with minimal performance degradation. Evaluative pruning results, at the attention head and layer levels, illustrate the memory efficiency of our trained subnetworks under various memory budgets.
ISSN:	1943-0663 1943-0671
DOI:	10.1109/LES.2023.3299638