Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer
Recently, Conformer has achieved state-of-the-art performance in many speech recognition tasks. However, the Transformer-based models show significant deterioration for long-form speech, such as lectures, because the self-attention mechanism becomes unreliable with the computation of the square orde...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recently, Conformer has achieved state-of-the-art performance in many speech
recognition tasks. However, the Transformer-based models show significant
deterioration for long-form speech, such as lectures, because the
self-attention mechanism becomes unreliable with the computation of the square
order of the input length. To solve the problem, we incorporate a kind of
state-space model, Hungry Hungry Hippos (H3), to replace or complement the
multi-head self-attention (MHSA). H3 allows for efficient modeling of long-form
sequences with a linear-order computation. In experiments using two datasets of
CSJ and LibriSpeech, our proposed H3-Conformer model performs efficient and
robust recognition of long-form speech. Moreover, we propose a hybrid of H3 and
MHSA and show that using H3 in higher layers and MHSA in lower layers provides
significant improvement in online recognition. We also investigate a parallel
use of H3 and MHSA in all layers, resulting in the best performance. |
---|---|
DOI: | 10.48550/arxiv.2410.04159 |