Hierarchical and parallel processing of auditory and modulation frequencies for automatic speech recognition

This paper investigates from an automatic speech recognition perspective, the most effective way of combining Multi Layer Perceptron (MLP) classifiers trained on different ranges of auditory and modulation frequencies. Two different combination schemes based on MLP are considered. The first one oper...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Speech communication 2010-10, Vol.52 (10), p.790-800
1. Verfasser: Valente, Fabio
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:This paper investigates from an automatic speech recognition perspective, the most effective way of combining Multi Layer Perceptron (MLP) classifiers trained on different ranges of auditory and modulation frequencies. Two different combination schemes based on MLP are considered. The first one operates in parallel fashion and is invariant to the order in which feature streams are introduced. The second one operates in hierarchical fashion and is sensitive to the order in which feature streams are introduced. The study is carried on a Large Vocabulary Continuous Speech Recognition system for transcription of meetings data using the TANDEM approach. Results reveal that (1) the combination of MLPs trained on different ranges of auditory frequencies is more effective if performed in parallel fashion; (2) the combination of MLPs trained on different ranges of modulation frequencies is more effective if performed in hierarchical fashion moving from high to low modulations; (3) the improvement obtained from separate processing of two modulation frequency ranges (12% relative WER reduction w.r.t. the single classifier approach) is considerably larger than the improvement obtained from separate processing of two auditory frequency ranges (4% relative WER reduction w.r.t. the single classifier approach). Similar results are also verified on other LVCSR systems and on other languages. Furthermore, the paper extends the discussion to the combination of classifiers trained on separate auditory–modulation frequency channels showing that previous conclusions hold also in this scenario.
ISSN:0167-6393
1872-7182
DOI:10.1016/j.specom.2010.05.007