Spatial Processing Front-End For Distant ASR Exploiting Self-Attention Channel Combinator
We present a novel multi-channel front-end based on channel shortening with theWeighted Prediction Error (WPE) method followed by a fixed MVDR beamformer used in combination with a recently proposed self-attention-based channel combination (SACC) scheme, for tackling the distant ASR problem. We show...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We present a novel multi-channel front-end based on channel shortening with
theWeighted Prediction Error (WPE) method followed by a fixed MVDR beamformer
used in combination with a recently proposed self-attention-based channel
combination (SACC) scheme, for tackling the distant ASR problem. We show that
the proposed system used as part of a ContextNet based end-to-end (E2E) ASR
system outperforms leading ASR systems as demonstrated by a 21.6% reduction in
relative WER on a multi-channel LibriSpeech playback dataset. We also show how
dereverberation prior to beamforming is beneficial and compare the WPE method
with a modified neural channel shortening approach. An analysis of the
non-intrusive estimate of the signal C50 confirms that the 8 channel WPE method
provides significant dereverberation of the signals (13.6 dB improvement). We
also show how the weights of the SACC system allow the extraction of accurate
spatial information which can be beneficial for other speech processing
applications like diarization. |
---|---|
DOI: | 10.48550/arxiv.2203.13919 |