Characterizing the Accuracy -- Efficiency Trade-off of Low-rank Decomposition in Language Models
Recent large language models (LLMs) employ billions of parameters to enable broad problem-solving capabilities. Such language models also tend to be memory-bound because of the dominance of matrix-vector and matrix-matrix multiplications with low arithmetic intensity. Therefore, optimizing the memor...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recent large language models (LLMs) employ billions of parameters to enable
broad problem-solving capabilities. Such language models also tend to be
memory-bound because of the dominance of matrix-vector and matrix-matrix
multiplications with low arithmetic intensity. Therefore, optimizing the memory
footprint and traffic is an important optimization direction for LLMs today.
Model compression methods such as quantization and parameter pruning have been
actively explored to achieve memory footprint and traffic optimization.
However, the accuracy-efficiency trade-off of rank pruning (i.e., low-rank
decomposition) for LLMs is not well-understood yet. Therefore, in this work, we
characterize the accuracy-efficiency trade-off of a low-rank decomposition
method, specifically Tucker decomposition, on recent language models, including
an open-source LLM, Llama 2. We formalize the low-rank decomposition design
space and show that the decomposition design space is enormous (e.g.,
O($2^{39}$) for Llama2-7B). To navigate such a vast design space, we formulate
it and perform thorough case studies of accuracy-efficiency trade-offs using
six widely used LLM benchmarks on BERT and Llama 2 models. Our results show
that we can achieve a 9\% model size reduction with minimal accuracy drops,
which range from 4\%p (\%p refers to "percentage point," which refers to the
absolute difference between two percentage numbers; 74\% -> 78\% = 4\%p
increase) to 10\%p, depending on the difficulty of the benchmark, without any
retraining to recover accuracy after decomposition. The results show that
low-rank decomposition can be a promising direction for LLM-based applications
that require real-time service at scale (e.g., AI agent and real-time coding
assistant), where the latency is as important as the model accuracy. |
---|---|
DOI: | 10.48550/arxiv.2405.06626 |