Segmenting Long Sentence Pairs for Statistical Machine Translation
In phrase-based statistical machine translation, the knowledge about phrase translation and phrase reordering is learned from the bilingual corpora. However, words may be poorly aligned in long sentence pairs in practice, which will then do harm to the following steps of the translation, such as phr...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In phrase-based statistical machine translation, the knowledge about phrase translation and phrase reordering is learned from the bilingual corpora. However, words may be poorly aligned in long sentence pairs in practice, which will then do harm to the following steps of the translation, such as phrase extraction, etc. A possible solution to this problem is segmenting long sentence pairs into shorter ones. In this paper, we present an effective approach to segmenting sentences based on the modified IBM translation model 1. We find that by taking into account the semantics of some words, as well as the length ratio of source and target sentences, the segmentation result is largely improved. We also discuss the effect of length factor to the segmentation result. Experiments show that our approach can improve the BLEU score of a phrase-based translation system by about 0.5 points. |
---|---|
DOI: | 10.1109/IALP.2009.20 |