Does mBERT understand Romansh? Evaluating word embeddings using word alignment
In Proceedings of the 8th edition of the Swiss Text Analytics Conference, 2023, pages 41-53, Neuchatel, Switzerland. Association for Computational Linguistics We test similarity-based word alignment models (SimAlign and awesome-align) in combination with word embeddings from mBERT and XLM-R on paral...
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In Proceedings of the 8th edition of the Swiss Text Analytics
Conference, 2023, pages 41-53, Neuchatel, Switzerland. Association for
Computational Linguistics We test similarity-based word alignment models (SimAlign and awesome-align)
in combination with word embeddings from mBERT and XLM-R on parallel sentences
in German and Romansh. Since Romansh is an unseen language, we are dealing with
a zero-shot setting. Using embeddings from mBERT, both models reach an
alignment error rate of 0.22, which outperforms fast_align, a statistical
model, and is on par with similarity-based word alignment for seen languages.
We interpret these results as evidence that mBERT contains information that can
be meaningful and applicable to Romansh.
To evaluate performance, we also present a new trilingual corpus, which we
call the DERMIT (DE-RM-IT) corpus, containing press releases made by the Canton
of Grisons in German, Romansh and Italian in the past 25 years. The corpus
contains 4 547 parallel documents and approximately 100 000 sentence pairs in
each language combination. We additionally present a gold standard for
German-Romansh word alignment. The data is available at
https://github.com/eyldlv/DERMIT-Corpus. |
---|---|
DOI: | 10.48550/arxiv.2306.08702 |