GPU-Accelerated WFST Beam Search Decoder for CTC-based Speech Recognition
While Connectionist Temporal Classification (CTC) models deliver state-of-the-art accuracy in automated speech recognition (ASR) pipelines, their performance has been limited by CPU-based beam search decoding. We introduce a GPU-accelerated Weighted Finite State Transducer (WFST) beam search decoder...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | While Connectionist Temporal Classification (CTC) models deliver
state-of-the-art accuracy in automated speech recognition (ASR) pipelines,
their performance has been limited by CPU-based beam search decoding. We
introduce a GPU-accelerated Weighted Finite State Transducer (WFST) beam search
decoder compatible with current CTC models. It increases pipeline throughput
and decreases latency, supports streaming inference, and also supports advanced
features like utterance-specific word boosting via on-the-fly composition. We
provide pre-built DLPack-based python bindings for ease of use with
Python-based machine learning frameworks at
https://github.com/nvidia-riva/riva-asrlib-decoder. We evaluated our decoder
for offline and online scenarios, demonstrating that it is the fastest beam
search decoder for CTC models. In the offline scenario it achieves up to 7
times more throughput than the current state-of-the-art CPU decoder and in the
online streaming scenario, it achieves nearly 8 times lower latency, with same
or better word error rate. |
---|---|
DOI: | 10.48550/arxiv.2311.04996 |