Reference Free Domain Adaptation for Translation of Noisy Questions with Question Specific Rewards
Community Question-Answering (CQA) portals serve as a valuable tool for helping users within an organization. However, making them accessible to non-English-speaking users continues to be a challenge. Translating questions can broaden the community's reach, benefiting individuals with similar i...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Community Question-Answering (CQA) portals serve as a valuable tool for
helping users within an organization. However, making them accessible to
non-English-speaking users continues to be a challenge. Translating questions
can broaden the community's reach, benefiting individuals with similar
inquiries in various languages. Translating questions using Neural Machine
Translation (NMT) poses more challenges, especially in noisy environments,
where the grammatical correctness of the questions is not monitored. These
questions may be phrased as statements by non-native speakers, with incorrect
subject-verb order and sometimes even missing question marks. Creating a
synthetic parallel corpus from such data is also difficult due to its noisy
nature. To address this issue, we propose a training methodology that
fine-tunes the NMT system only using source-side data. Our approach balances
adequacy and fluency by utilizing a loss function that combines BERTScore and
Masked Language Model (MLM) Score. Our method surpasses the conventional
Maximum Likelihood Estimation (MLE) based fine-tuning approach, which relies on
synthetic target data, by achieving a 1.9 BLEU score improvement. Our model
exhibits robustness while we add noise to our baseline, and still achieve 1.1
BLEU improvement and large improvements on TER and BLEURT metrics. Our proposed
methodology is model-agnostic and is only necessary during the training phase.
We make the codes and datasets publicly available at
\url{https://www.iitp.ac.in/~ai-nlp-ml/resources.html#DomainAdapt} for
facilitating further research. |
---|---|
DOI: | 10.48550/arxiv.2310.15259 |