Vocal Tract Length Estimation Using Accumulated Means of Formants and Its Effects on Speaker-Normalization

Differences in vocal tract lengths (VTLs) in individual speakers cause variations in acoustic features of phonemes. In this paper, a simple method to estimate speaker-specific VTLs and to quantitatively evaluate some speaker-normalization effects of the VTLs is proposed. We employed accumulated mean...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2021, Vol.29, p.1049-1064
Hauptverfasser: Sakata, Tadashi, Ikeda, Naomitsu, Ueda, Yuichi, Watanabe, Akira
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Differences in vocal tract lengths (VTLs) in individual speakers cause variations in acoustic features of phonemes. In this paper, a simple method to estimate speaker-specific VTLs and to quantitatively evaluate some speaker-normalization effects of the VTLs is proposed. We employed accumulated means of formant trajectories to estimate the VTLs of speakers ranging from children to adults. For the formant estimation, the inverse-filter control (IFC) system was used. In the system, the decision of analysis order, which means number of formants to be estimated, is automated. Moreover, to evaluate the speaker-normalization effect of VTLs, we proposed the data reduction method, which can reasonably find dense areas of ellipses from distributions in the formant space. Using these ellipse areas, we evaluated the three normalization effects of VTLs: normalization by the mean of all VTLs as the standard, by speaker-categorical means of VTLs, and by individual VTLs. The area reduced from the standard area of the original data by 39.5% and 46.6% in the case of the categorical means and individual VTLs, respectively. As a result, our proposed method was used to provide a "normalized vowel map (NVM)" that visualizes universal vowel-distributions as a core image of linguistic information. Finally, we compared the estimated VTLs with those by another method based on magnetic resonance imaging (MRI) data, using the proposed methods.
ISSN:2329-9290
2329-9304
DOI:10.1109/TASLP.2021.3060172