LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings
Sentence embedding models play a key role in various Natural Language Processing tasks, such as in Topic Modeling, Document Clustering and Recommendation Systems. However, these models rely heavily on parallel data, which can be scarce for many low-resource languages, including Luxembourgish. This s...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Sentence embedding models play a key role in various Natural Language
Processing tasks, such as in Topic Modeling, Document Clustering and
Recommendation Systems. However, these models rely heavily on parallel data,
which can be scarce for many low-resource languages, including Luxembourgish.
This scarcity results in suboptimal performance of monolingual and
cross-lingual sentence embedding models for these languages. To address this
issue, we compile a relatively small but high-quality human-generated
cross-lingual parallel dataset to train LuxEmbedder, an enhanced sentence
embedding model for Luxembourgish with strong cross-lingual capabilities.
Additionally, we present evidence suggesting that including low-resource
languages in parallel training datasets can be more advantageous for other
low-resource languages than relying solely on high-resource language pairs.
Furthermore, recognizing the lack of sentence embedding benchmarks for
low-resource languages, we create a paraphrase detection benchmark specifically
for Luxembourgish, aiming to partially fill this gap and promote further
research. |
---|---|
DOI: | 10.48550/arxiv.2412.03331 |