CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference
Pre-trained large language models (LLMs) often need specialization for domain-specific tasks. Low-Rank Adaptation (LoRA) is a popular approach that adapts a base model to multiple tasks by adding lightweight trainable adapters. In this paper, we present CaraServe, a system that efficiently serves ma...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Pre-trained large language models (LLMs) often need specialization for
domain-specific tasks. Low-Rank Adaptation (LoRA) is a popular approach that
adapts a base model to multiple tasks by adding lightweight trainable adapters.
In this paper, we present CaraServe, a system that efficiently serves many LoRA
adapters derived from a common base model. CaraServe maintains the base model
on GPUs and dynamically loads activated LoRA adapters from main memory. As GPU
loading results in a cold-start that substantially delays token generation,
CaraServe employs a CPU-assisted approach. It early starts the activated
adapters on CPUs for prefilling as they are being loaded onto GPUs; after
loading completes, it then switches to the GPUs for generative LoRA inference.
CaraServe develops a highly optimized synchronization mechanism to efficiently
coordinate LoRA computation on the CPU and GPU. Moreover, CaraServe employs a
rank-aware scheduling algorithm to optimally schedule heterogeneous LoRA
requests for maximum service-level objective (SLO) attainment. We have
implemented CaraServe and evaluated it against state-of-the-art LoRA serving
systems. Our results demonstrate that CaraServe can speed up the average
request serving latency by up to 1.4$\times$ and achieve an SLO attainment of
up to 99%. |
---|---|
DOI: | 10.48550/arxiv.2401.11240 |