4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders
The network architecture of end-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and non-autoregressive mask-predict models. Since each of th...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The network architecture of end-to-end (E2E) automatic speech recognition
(ASR) can be classified into several models, including connectionist temporal
classification (CTC), recurrent neural network transducer (RNN-T), attention
mechanism, and non-autoregressive mask-predict models. Since each of these
network architectures has pros and cons, a typical use case is to switch these
separate models depending on the application requirement, resulting in the
increased overhead of maintaining all models. Several methods for integrating
two of these complementary models to mitigate the overhead issue have been
proposed; however, if we integrate more models, we will further benefit from
these complementary models and realize broader applications with a single
system. This paper proposes four-decoder joint modeling (4D) of CTC, attention,
RNN-T, and mask-predict, which has the following three advantages: 1) The four
decoders are jointly trained so that they can be easily switched depending on
the application scenarios. 2) Joint training may bring model regularization and
improve the model robustness thanks to their complementary properties. 3) Novel
one-pass joint decoding methods using CTC, attention, and RNN-T further
improves the performance. The experimental results showed that the proposed
model consistently reduced the WER. |
---|---|
DOI: | 10.48550/arxiv.2212.10818 |