Unsupervised spoken term discovery using pseudo lexical induction
An unsupervised spoken term discovery task aims to capture the pattern similarities among spoken terms in the absence of annotation. Such an approach is useful for the untranscribed spoken content from low-resource or zero-resource languages. A challenge in the discovery task is to compute the simil...
Gespeichert in:
Veröffentlicht in: | International journal of speech technology 2023-09, Vol.26 (3), p.801-816 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | An unsupervised spoken term discovery task aims to capture the pattern similarities among spoken terms in the absence of annotation. Such an approach is useful for the untranscribed spoken content from low-resource or zero-resource languages. A challenge in the discovery task is to compute the similarities among spoken terms without annotation. Dynamic time warping (DTW) is one of the techniques that computes temporal alignment between two acoustic feature representations of the speech signal without annotation. However, the speech variabilities that arise in natural speech introduce a challenge to the DTW approach. As a result, the performance of the spoken term discovery task was degraded. In this study, we overcome the challenges and improve the performance of the discovery task in three stages. At first, the speaker-independent acoustic feature representation was obtained from the Self Organising Map (SOM) to reduce the variabilities. In the second stage, non-segmental pseudo-labels were generated for the spoken content using context-free grammar. Finally, the spoken term similarities were obtained by grouping the similar sequences using proposed Label Sequence Similarity Mapping and Language modelling algorithms. The performance of the proposed system was measured using the Zero-Speech challenge corpus in terms of matching, clustering and parsing qualities. The experimental results reveal that the proposed approach improves the performance by 34.2% and 22.4% in English and Xitsonga, respectively, across multiple speakers. In addition, the clustering performance of the spoken terms at the word level was improved by 4.2% in English. |
---|---|
ISSN: | 1381-2416 1572-8110 |
DOI: | 10.1007/s10772-023-10049-6 |