InstaTrans: An Instruction-Aware Translation Framework for Non-English Instruction Datasets
It is challenging to generate high-quality instruction datasets for non-English languages due to tail phenomena, which limit performance on less frequently observed data. To mitigate this issue, we propose translating existing high-quality English instruction datasets as a solution, emphasizing the...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | It is challenging to generate high-quality instruction datasets for
non-English languages due to tail phenomena, which limit performance on less
frequently observed data. To mitigate this issue, we propose translating
existing high-quality English instruction datasets as a solution, emphasizing
the need for complete and instruction-aware translations to maintain the
inherent attributes of these datasets. We claim that fine-tuning LLMs with
datasets translated in this way can improve their performance in the target
language. To this end, we introduces a new translation framework tailored for
instruction datasets, named InstaTrans (INSTruction-Aware TRANSlation). Through
extensive experiments, we demonstrate the superiority of InstaTrans over other
competitors in terms of completeness and instruction-awareness of translation,
highlighting its potential to broaden the accessibility of LLMs across diverse
languages at a relatively low cost. Furthermore, we have validated that
fine-tuning LLMs with datasets translated by InstaTrans can effectively improve
their performance in the target language. |
---|---|
DOI: | 10.48550/arxiv.2410.01512 |