DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs
Fine-tuning large language models (LLMs) greatly improves model quality for downstream tasks. However, serving many fine-tuned LLMs concurrently is challenging due to the sporadic, bursty, and varying request patterns of different LLMs. To bridge this gap, we present DeltaZip, an LLM serving system...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Fine-tuning large language models (LLMs) greatly improves model quality for
downstream tasks. However, serving many fine-tuned LLMs concurrently is
challenging due to the sporadic, bursty, and varying request patterns of
different LLMs. To bridge this gap, we present DeltaZip, an LLM serving system
that efficiently serves multiple full-parameter fine-tuned models concurrently
by aggressively compressing model deltas by up to 10x while maintaining high
model quality. The key insight behind this design is that fine-tuning results
in small-magnitude changes to the pre-trained model. By co-designing the
serving system with the compression algorithm, DeltaZip achieves 2x to 12x
improvement in throughput compared to the state-of-the-art systems. |
---|---|
DOI: | 10.48550/arxiv.2312.05215 |