Fusion of parametric and non-parametric approaches to noise-robust ASR

•The posteriors produced by a GMM and a non-parametric system called Sparse Classification are combined in this work.•These two posteriors are different in terms of the distribution of probability vector and corresponding performances.•The combination is conducted in Dynamic Bayesian Network via Vir...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Speech communication 2014-01, Vol.56 (Jan), p.49-62
Hauptverfasser:	Sun, Yang, Gemmeke, Jort F., Cranen, Bert, Bosch, Louis ten, Boves, Lou
Format:	Artikel
Sprache:	eng
Schlagworte:	Decoding Dynamic Bayesian Network Dynamics Early fusion Estimates Gaussian Mathematical models Robust speech recognition Sparse Classification Streams Tasks Test sets Virtual Evidence
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•The posteriors produced by a GMM and a non-parametric system called Sparse Classification are combined in this work.•These two posteriors are different in terms of the distribution of probability vector and corresponding performances.•The combination is conducted in Dynamic Bayesian Network via Virtual Evidence, allowing a joint training of two streams.•This paper shows that besides weighting approach, reducing the support of SC also leads to an overall good combination. In this paper we present a principled method for the fusion of independent estimates of the state likelihood in a Dynamic Bayesian Network (DBN) by means of the Virtual Evidence option for improving speech recognition in the aurora-2 task. A first estimate is derived from a conventional parametric Gaussian Mixture Model; a second estimate is obtained from a non-parametric Sparse Classification (SC) system. During training the parameters pertaining to the input streams can be optimized independently, but also jointly, provided that all streams represent true probability functions. During decoding the weights of the streams can be varied much more freely. It appeared that the state likelihoods in the GMM and SC streams are very different, and that this makes it necessary to apply different weights to the streams in decoding. When using optimal weights, the dual-input system can outperform the individual GMM or the SC systems for all SNR levels in test sets A and B in the aurora-2 task.
ISSN:	0167-6393 1872-7182
DOI:	10.1016/j.specom.2013.07.003