SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching
Compressive adaptation approaches, such as QLoRA, are widely popular alternatives for reducing memory requirements during fine-tuning of large language models (LLMs) while producing models capable of handling various downstream tasks. The key idea is to employ a "two-tower" architecture: c...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Compressive adaptation approaches, such as QLoRA, are widely popular
alternatives for reducing memory requirements during fine-tuning of large
language models (LLMs) while producing models capable of handling various
downstream tasks. The key idea is to employ a "two-tower" architecture:
compressing pre-trained LLM parameters into compact representations and
fine-tuning the additive full-precision adapter, which typically has few
tunable parameters in low-rank format. However, the strict algebraic
assumptions, such as low-rank assumption, and the complexity of composing
two-tower architectures are some of the known shortcomings, resulting in a poor
accuracy-efficiency trade-off. In response to these known limitations, we
propose SpaLLM (Sketched Parameter Adaptation of LLMs), a novel compressive
adaptation approach for LLMs. This method is also the first to illustrate
parameter-sharing compression methods for LLM fine-tuning, which, unlike QLoRA,
are free from strict low-rank algebraic assumptions on adapters. Furthermore,
our proposal unifies model compression and adaptation into a single,
streamlined process, eliminating the need for two-tower architectures. SpaLLM
sketches pre-trained LLM weights into lookup tables and directly fine-tunes the
values in these tables. This approach simplifies LLMs' compressive adaptation
workflow, potentially improves multi-user serving efficiency, and delivers
significantly better accuracy for both natural language understanding and
generation tasks. Moreover, by avoiding the "two-tower" architecture, our
framework only requires one compressed matrix multiplication per layer during
inference, demonstrating superior inference efficiency compared to previous
methods. |
---|---|
DOI: | 10.48550/arxiv.2410.06364 |