Hierarchical and parallel processing of auditory and modulation frequencies for automatic speech recognition
This paper investigates from an automatic speech recognition perspective, the most effective way of combining Multi Layer Perceptron (MLP) classifiers trained on different ranges of auditory and modulation frequencies. Two different combination schemes based on MLP are considered. The first one oper...
Gespeichert in:
Veröffentlicht in: | Speech communication 2010-10, Vol.52 (10), p.790-800 |
---|---|
1. Verfasser: | |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This paper investigates from an automatic speech recognition perspective, the most effective way of combining Multi Layer Perceptron (MLP) classifiers trained on different ranges of auditory and modulation frequencies. Two different combination schemes based on MLP are considered. The first one operates in parallel fashion and is invariant to the order in which feature streams are introduced. The second one operates in hierarchical fashion and is sensitive to the order in which feature streams are introduced. The study is carried on a Large Vocabulary Continuous Speech Recognition system for transcription of meetings data using the TANDEM approach. Results reveal that (1) the combination of MLPs trained on different ranges of auditory frequencies is more effective if performed in parallel fashion; (2) the combination of MLPs trained on different ranges of modulation frequencies is more effective if performed in hierarchical fashion moving from high to low modulations; (3) the improvement obtained from separate processing of two modulation frequency ranges (12% relative WER reduction w.r.t. the single classifier approach) is considerably larger than the improvement obtained from separate processing of two auditory frequency ranges (4% relative WER reduction w.r.t. the single classifier approach). Similar results are also verified on other LVCSR systems and on other languages. Furthermore, the paper extends the discussion to the combination of classifiers trained on separate auditory–modulation frequency channels showing that previous conclusions hold also in this scenario. |
---|---|
ISSN: | 0167-6393 1872-7182 |
DOI: | 10.1016/j.specom.2010.05.007 |