TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation
Proceedings of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT5), 2022 We present TURJUMAN, a neural toolkit for translating from 20 languages into Modern Standard Arabic (MSA). TURJUMAN exploits the recently-introduced text-to-text Transformer AraT5 model, endowing it wit...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Proceedings of the 5th Workshop on Open-Source Arabic Corpora and
Processing Tools (OSACT5), 2022 We present TURJUMAN, a neural toolkit for translating from 20 languages into
Modern Standard Arabic (MSA). TURJUMAN exploits the recently-introduced
text-to-text Transformer AraT5 model, endowing it with a powerful ability to
decode into Arabic. The toolkit offers the possibility of employing a number of
diverse decoding methods, making it suited for acquiring paraphrases for the
MSA translations as an added value. To train TURJUMAN, we sample from publicly
available parallel data employing a simple semantic similarity method to ensure
data quality. This allows us to prepare and release AraOPUS-20, a new machine
translation benchmark. We publicly release our translation toolkit (TURJUMAN)
as well as our benchmark dataset (AraOPUS-20). |
---|---|
DOI: | 10.48550/arxiv.2206.03933 |