Expanding FLORES+ Benchmark for more Low-Resource Settings: Portuguese-Emakhuwa Machine Translation Evaluation
As part of the Open Language Data Initiative shared tasks, we have expanded the FLORES+ evaluation set to include Emakhuwa, a low-resource language widely spoken in Mozambique. We translated the dev and devtest sets from Portuguese into Emakhuwa, and we detail the translation process and quality ass...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | As part of the Open Language Data Initiative shared tasks, we have expanded
the FLORES+ evaluation set to include Emakhuwa, a low-resource language widely
spoken in Mozambique. We translated the dev and devtest sets from Portuguese
into Emakhuwa, and we detail the translation process and quality assurance
measures used. Our methodology involved various quality checks, including
post-editing and adequacy assessments. The resulting datasets consist of
multiple reference sentences for each source. We present baseline results from
training a Neural Machine Translation system and fine-tuning existing
multilingual translation models. Our findings suggest that spelling
inconsistencies remain a challenge in Emakhuwa. Additionally, the baseline
models underperformed on this evaluation set, underscoring the necessity for
further research to enhance machine translation quality for Emakhuwa. The data
is publicly available at https://huggingface.co/datasets/LIACC/Emakhuwa-FLORES. |
---|---|
DOI: | 10.48550/arxiv.2408.11457 |