Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO

Inference optimizations are critical for improving user experience and reducing infrastructure costs and power consumption. In this article, we illustrate a form of dynamic execution known as speculative sampling to reduce the overall latency of text generation and compare it with standard autoregre...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-04
Hauptverfasser:	Barad, Haim, Aidova, Ekaterina, Gorbachev, Yury
Format:	Artikel
Sprache:	eng
Schlagworte:	Generative artificial intelligence Power consumption Sampling methods User experience
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Inference optimizations are critical for improving user experience and reducing infrastructure costs and power consumption. In this article, we illustrate a form of dynamic execution known as speculative sampling to reduce the overall latency of text generation and compare it with standard autoregressive sampling. This can be used together with model-based optimizations (e.g. quantization) to provide an optimized solution. Both sampling methods make use of KV caching. A Jupyter notebook and some sample executions are provided.
ISSN:	2331-8422