Application of the multilingual acoustic representation model XLSR for the transcription of Ewondo

Recently popularized self-supervised models appear as a solution to the problem of low data availability via parsimonious learning transfer. We investigate the effectiveness of these multilingual acoustic models, in this case wav2vec 2.0 XLSR-53 and wav2vec 2.0 XLSR-128, for the transcription task o...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ARIMA 2024-10, Vol.42 - Special issue CRI...
Hauptverfasser:	Yannick Yomie, Nzeuhang, Paulin Melatagia, Yonta, Benjamin, Lecouteux
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial Intelligence Computer Science
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Recently popularized self-supervised models appear as a solution to the problem of low data availability via parsimonious learning transfer. We investigate the effectiveness of these multilingual acoustic models, in this case wav2vec 2.0 XLSR-53 and wav2vec 2.0 XLSR-128, for the transcription task of the Ewondo language (spoken in Cameroon). The experiments were conducted on 11 minutes of speech constructed from 103 read sentences. Despite a strong generalization capacity of multilingual acoustic model, preliminary results show that the distance between XLSR embedded languages (English, French, Spanish, German, Mandarin, . . . ) and Ewondo strongly impacts the performance of the transcription model. The highest performances obtained are around 69% on the WER and 28.1% on the CER. An analysis of these preliminary results is carried out andthen interpreted; in order to ultimately propose effective ways of improvement. Les modèles auto-supervisés récemment popularisés apparaissent comme une solution au problème de la faible disponibilité des données grâce à un transfert d'apprentissage parcimonieux. Nous étudions l'efficacité de ces modèles acoustiques multilingues, en l'occurrence wav2vec 2.0 XLSR-53 et wav2vec 2.0 XLSR-128, pour la tâche de transcription de la langue Ewondo (parlée au Cameroun). Les expériences ont été menées sur 11 minutes de discours construits à partir de 103 phrases lues. Malgré une forte capacité de généralisation du modèle acoustique multilingue, les résultats préliminaires montrent que la distance entre les langues intégrées dans le XLSR (anglais, français, espagnol, allemand, mandarin, . . .) et l'ewondo a un impact important sur la performance du modèle de transcription. Les performances les plus élevées obtenues sont de l'ordre de 69% pour le WER et de 28.1% pour le CER. Une analyse de ces résultats préliminaires est effectuée et interprétée afin de proposer des pistes d'amélioration efficaces.
ISSN:	1638-5713 1638-5713
DOI:	10.46298/arima.13621