VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storag...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-10
Hauptverfasser:	Liu, Yifei, Wen, Jicheng, Wang, Yang, Ye, Shengyu, Li Lyna Zhang, Cao, Ting, Cheng, Li, Mao, Yang
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Algorithms Design optimization Inference Large language models Lookup tables Optimization Redundancy
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Schreiben Sie den ersten Kommentar!