Convolutional Maximum-Likelihood Distortionless Response Beamforming With Steering Vector Estimation for Robust Speech Recognition
Beamforming has been one of the most successful approaches using multi-microphones for robust speech recognition. Although a beamforming method, called the "maximum-likelihood distortionless response (MLDR)" beamformer, was recently presented to achieve promising performance, it requires a...
Gespeichert in:
Veröffentlicht in: | IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2021, Vol.29, p.1352-1367 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Beamforming has been one of the most successful approaches using multi-microphones for robust speech recognition. Although a beamforming method, called the "maximum-likelihood distortionless response (MLDR)" beamformer, was recently presented to achieve promising performance, it requires an accurate steering vector for a target speaker in advance like many kinds of beamformers. In this paper, we present a method for steering vector estimation (SVE) by replacing the noise spatial covariance matrix estimate with a normalized version of the variance-weighted spatial covariance matrix estimate for the observed noisy speech signal obtained by the iterative update rule in the MLDR beamforming framework. In addition, an MLDR beamforming method without a steering vector for a target speaker given in advance is presented where the SVE and the beamforming are alternately repeated. Furthermore, an online algorithm based on recursive least squares (RLS) is derived to cope with various practical applications including time-varying situations, and the power method is introduced for further efficient online processing. We also present batch and online convolutional MLDR beamforming with SVE for simultaneous beamforming and dereverberation where the weighted prediction error (WPE) dereverberation and the MLDR beamforming with the SVE were jointly optimized based on the maximum-likelihood estimation (MLE) for a zero-mean complex Gaussian signal with time-varying variances. Moreover, input signals masked by a neural network (NN) for estimating target speech or noise components can be used to further improve the presented beamformers. Experimental results on the CHiME-4 and REVERB challenge datasets demonstrate the effectiveness of the presented methods. |
---|---|
ISSN: | 2329-9290 2329-9304 |
DOI: | 10.1109/TASLP.2021.3067202 |