Efficient Inference for Large Language Model-based Generative Recommendation

Large Language Model (LLM)-based generative recommendation has achieved notable success, yet its practical deployment is costly particularly due to excessive inference latency caused by autoregressive decoding. For lossless LLM decoding acceleration, Speculative Decoding (SD) has emerged as a promis...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Lin, Xinyu, Yang, Chaoqun, Wang, Wenjie, Li, Yongqi, Du, Cunxiao, Feng, Fuli, Ng, See-Kiong, Chua, Tat-Seng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!