Classification Based on Speech Rhythm via a Temporal Alignment of Spoken Sentences
How much information is contained in the rhythm of speech? Is it possible to tell, just from the rhythm of the speech, whether the speaker is male or female? Is it possible to tell if they are a native or nonnative speaker? This paper provides a new way to address such questions. Traditional investi...
Gespeichert in:
Veröffentlicht in: | IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2015-12, Vol.23 (12), p.2209-2216 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | How much information is contained in the rhythm of speech? Is it possible to tell, just from the rhythm of the speech, whether the speaker is male or female? Is it possible to tell if they are a native or nonnative speaker? This paper provides a new way to address such questions. Traditional investigations into speech rhythm approach the problem by manually annotating the speech, and investigating a preselected collection of features such as the durations of vowels or inter-phoneme timings. This paper presents a method that can automatically align the audio of multiple people when speaking the same sentence. The output of the alignment procedure is a mapping (from the micro-timing of one speaker to that of another) that can be used as a surrogate for speech rhythm. The method is applied to a large online corpus of speakers and shows that it is possible to classify the speakers based on these mappings alone. Several technical aspects are discussed. First, the spectrograms switch between different-length analysis windows (based on whether the speech is voiced or unvoiced) to ameliorate the time-frequency trade-off. These variable window spectrograms are fed into a dynamic time warping algorithm to produce a timing map which represents the speech rhythm. The accuracy of the alignment is evaluated by a technique of transitive validation, and the timing maps are used to form a feature vector for the classification. The method is applied to the online Speech Accent Archive corpus. In the gender discrimination experiments, the proposed method was only about 5% worse than a state-of-the-art classifier based on spectral feature vectors. In the native speaker discrimination task, the speech rhythm was about 15% better than when using spectral information. |
---|---|
ISSN: | 2329-9290 2329-9304 |
DOI: | 10.1109/TASLP.2015.2475155 |