Streaming Target-Speaker ASR with Neural Transducer
Although recent advances in deep learning technology have boosted automatic speech recognition (ASR) performance in the single-talker case, it remains difficult to recognize multi-talker speech in which many voices overlap. One conventional approach to tackle this problem is to use a cascade of a sp...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Although recent advances in deep learning technology have boosted automatic
speech recognition (ASR) performance in the single-talker case, it remains
difficult to recognize multi-talker speech in which many voices overlap. One
conventional approach to tackle this problem is to use a cascade of a speech
separation or target speech extraction front-end with an ASR back-end. However,
the extra computation costs of the front-end module are a critical barrier to
quick response, especially for streaming ASR. In this paper, we propose a
target-speaker ASR (TS-ASR) system that implicitly integrates the target speech
extraction functionality within a streaming end-to-end (E2E) ASR system, i.e.
recurrent neural network-transducer (RNNT). Our system uses a similar idea as
adopted for target speech extraction, but implements it directly at the level
of the encoder of RNNT. This allows TS-ASR to be realized without placing extra
computation costs on the front-end. Note that this study presents two major
differences between prior studies on E2E TS-ASR; we investigate streaming
models and base our study on Conformer models, whereas prior studies used
RNN-based systems and considered only offline processing. We confirm in
experiments that our TS-ASR achieves comparable recognition performance with
conventional cascade systems in the offline setting, while reducing computation
costs and realizing streaming TS-ASR. |
---|---|
DOI: | 10.48550/arxiv.2209.04175 |