Efficient LLM Scheduling by Learning to Rank
In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to Head-Of-Line (HOL) blocking and reduced throughput and servic...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In Large Language Model (LLM) inference, the output length of an LLM request
is typically regarded as not known a priori. Consequently, most LLM serving
systems employ a simple First-come-first-serve (FCFS) scheduling strategy,
leading to Head-Of-Line (HOL) blocking and reduced throughput and service
quality. In this paper, we reexamine this assumption -- we show that, although
predicting the exact generation length of each request is infeasible, it is
possible to predict the relative ranks of output lengths in a batch of
requests, using learning to rank. The ranking information offers valuable
guidance for scheduling requests. Building on this insight, we develop a novel
scheduler for LLM inference and serving that can approximate the
shortest-job-first (SJF) schedule better than existing approaches. We integrate
this scheduler with the state-of-the-art LLM serving system and show
significant performance improvement in several important applications: 2.8x
lower latency in chatbot serving and 6.5x higher throughput in synthetic data
generation. Our code is available at https://github.com/hao-ai-lab/vllm-ltr.git |
---|---|
DOI: | 10.48550/arxiv.2408.15792 |