Optimized Speculative Sampling for GPU Hardware Accelerators
In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the workload across multiple GPU...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this work, we optimize speculative sampling for parallel hardware
accelerators to improve sampling speed. We notice that substantial portions of
the intermediate matrices necessary for speculative sampling can be computed
concurrently. This allows us to distribute the workload across multiple GPU
threads, enabling simultaneous operations on matrix segments within thread
blocks. This results in profiling time improvements ranging from 6% to 13%
relative to the baseline implementation, without compromising accuracy. To
further accelerate speculative sampling, probability distributions
parameterized by softmax are approximated by sigmoid. This approximation
approach results in significantly greater relative improvements in profiling
time, ranging from 37% to 94%, with a minor decline in accuracy. We conduct
extensive experiments on both automatic speech recognition and summarization
tasks to validate the effectiveness of our optimization methods. |
---|---|
DOI: | 10.48550/arxiv.2406.11016 |