CRVQ: Channel-relaxed Vector Quantization for Extreme Compression of LLMs
Powerful large language models (LLMs) are increasingly expected to be deployed with lower computational costs, enabling their capabilities on resource-constrained devices. Post-training quantization (PTQ) has emerged as a star approach to achieve this ambition, with best methods compressing weights...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Powerful large language models (LLMs) are increasingly expected to be
deployed with lower computational costs, enabling their capabilities on
resource-constrained devices. Post-training quantization (PTQ) has emerged as a
star approach to achieve this ambition, with best methods compressing weights
to less than 2 bit on average. In this paper, we propose Channel-Relaxed Vector
Quantization (CRVQ), a novel technique that significantly improves the
performance of PTQ baselines at the cost of only minimal additional bits. This
state-of-the-art extreme compression method achieves its results through two
key innovations: (1) carefully selecting and reordering a very small subset of
critical weight channels, and (2) leveraging multiple codebooks to relax the
constraint of critical channels. With our method, we demonstrate a 38.9%
improvement over the current strongest sub-2-bit PTQ baseline, enabling nearer
lossless 1-bit compression. Furthermore, our approach offers flexible
customization of quantization bit-width and performance, providing a wider
range of deployment options for diverse hardware platforms. |
---|---|
DOI: | 10.48550/arxiv.2412.09282 |