Resource-Efficient Fine-Tuning Strategies for Automatic MOS Prediction in Text-to-Speech for Low-Resource Languages
We train a MOS prediction model based on wav2vec 2.0 using the open-access data sets BVCC and SOMOS. Our test with neural TTS data in the low-resource language (LRL) West Frisian shows that pre-training on BVCC before fine-tuning on SOMOS leads to the best accuracy for both fine-tuned and zero-shot...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We train a MOS prediction model based on wav2vec 2.0 using the open-access
data sets BVCC and SOMOS. Our test with neural TTS data in the low-resource
language (LRL) West Frisian shows that pre-training on BVCC before fine-tuning
on SOMOS leads to the best accuracy for both fine-tuned and zero-shot
prediction. Further fine-tuning experiments show that using more than 30
percent of the total data does not lead to significant improvements. In
addition, fine-tuning with data from a single listener shows promising
system-level accuracy, supporting the viability of one-participant pilot tests.
These findings can all assist the resource-conscious development of TTS for
LRLs by progressing towards better zero-shot MOS prediction and informing the
design of listening tests, especially in early-stage evaluation. |
---|---|
DOI: | 10.48550/arxiv.2305.19396 |