L1-aware Multilingual Mispronunciation Detection Framework
The phonological discrepancies between a speaker's native (L1) and the non-native language (L2) serves as a major factor for mispronunciation. This paper introduces a novel multilingual MDD architecture, L1-MultiMDD, enriched with L1-aware speech representation. An end-to-end speech encoder is...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The phonological discrepancies between a speaker's native (L1) and the
non-native language (L2) serves as a major factor for mispronunciation. This
paper introduces a novel multilingual MDD architecture, L1-MultiMDD, enriched
with L1-aware speech representation. An end-to-end speech encoder is trained on
the input signal and its corresponding reference phoneme sequence. First, an
attention mechanism is deployed to align the input audio with the reference
phoneme sequence. Afterwards, the L1-L2-speech embedding are extracted from an
auxiliary model, pretrained in a multi-task setup identifying L1 and L2
language, and are infused with the primary network. Finally, the L1-MultiMDD is
then optimized for a unified multilingual phoneme recognition task using
connectionist temporal classification (CTC) loss for the target languages:
English, Arabic, and Mandarin. Our experiments demonstrate the effectiveness of
the proposed L1-MultiMDD framework on both seen -- L2-ARTIC, LATIC, and
AraVoiceL2v2; and unseen -- EpaDB and Speechocean762 datasets. The consistent
gains in PER, and false rejection rate (FRR) across all target languages
confirm our approach's robustness, efficacy, and generalizability. |
---|---|
DOI: | 10.48550/arxiv.2309.07719 |