Spoken word recognition using a novel speech boundary segment of voiceless articulatory consonants
Communication through speech offers the most straightforward channel for man-machine interaction. Nevertheless, it is a barrier for some languages with low data resources. Extracting features and processing silence in a speech signal is an unnecessary extra effort. Noise in the speech signal reduces...
Gespeichert in:
Veröffentlicht in: | International journal of information technology (Singapore. Online) 2024-04, Vol.16 (4), p.2661-2673 |
---|---|
Hauptverfasser: | , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Communication through speech offers the most straightforward channel for man-machine interaction. Nevertheless, it is a barrier for some languages with low data resources. Extracting features and processing silence in a speech signal is an unnecessary extra effort. Noise in the speech signal reduces classification accuracy. Therefore, silence and noise are removed from the signal to improve recognition. Nonetheless, current approaches rely on static Zero-Crossing-Rate (ZCR) and energy values for the detection. Through the analysis of the speech signal, it has been determined that the utilization of fixed ZCR and energy values do not effectively address the delineation of unvoiced consonant boundaries in speech. The use of static values fails to accurately identify the speech boundary during the articulation of these unvoiced consonants. Therefore, in this study, the dynamic value of ZCR and energy has been derived to overcome this problem. Here, roughly a spoken region has first been identified from each speech signal of a non-overlapping frame. In the second step, the dynamic values are derived by two novel algorithms. Two standard datasets, the Free Spoken Digit Dataset (FSDD) and the Bangla 0 to 99 Dataset (Bangla Dataset), spoken words in English and Bengali, respectively, have been used in this study. The Mel Frequency Cepstral Coefficients (MFCC) have been extracted from each raw signal and the proposed pre-processed signal. Subsequently, these features are input into a Bidirectional Long-Short-Term-Memory (BiLSTM) network. The result shows the superiority of the proposed pre-processing methods. |
---|---|
ISSN: | 2511-2104 2511-2112 |
DOI: | 10.1007/s41870-024-01776-3 |