Improving Pronunciation and Accent Conversion through Knowledge Distillation And Synthetic Ground-Truth from Native TTS
Previous approaches on accent conversion (AC) mainly aimed at making non-native speech sound more native while maintaining the original content and speaker identity. However, non-native speakers sometimes have pronunciation issues, which can make it difficult for listeners to understand them. Hence,...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Previous approaches on accent conversion (AC) mainly aimed at making
non-native speech sound more native while maintaining the original content and
speaker identity. However, non-native speakers sometimes have pronunciation
issues, which can make it difficult for listeners to understand them. Hence, we
developed a new AC approach that not only focuses on accent conversion but also
improves pronunciation of non-native accented speaker. By providing the
non-native audio and the corresponding transcript, we generate the ideal
ground-truth audio with native-like pronunciation with original duration and
prosody. This ground-truth data aids the model in learning a direct mapping
between accented and native speech. We utilize the end-to-end VITS framework to
achieve high-quality waveform reconstruction for the AC task. As a result, our
system not only produces audio that closely resembles native accents and while
retaining the original speaker's identity but also improve pronunciation, as
demonstrated by evaluation results. |
---|---|
DOI: | 10.48550/arxiv.2410.14997 |