Segmenting Long Sentence Pairs for Statistical Machine Translation

In phrase-based statistical machine translation, the knowledge about phrase translation and phrase reordering is learned from the bilingual corpora. However, words may be poorly aligned in long sentence pairs in practice, which will then do harm to the following steps of the translation, such as phr...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Biping Meng, Shujian Huang, Xinyu Dai, Jiajun Chen
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Computer science Costs length normalization long sentences Particle separators Poisson distributed length ratio segmentation semantics guided
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In phrase-based statistical machine translation, the knowledge about phrase translation and phrase reordering is learned from the bilingual corpora. However, words may be poorly aligned in long sentence pairs in practice, which will then do harm to the following steps of the translation, such as phrase extraction, etc. A possible solution to this problem is segmenting long sentence pairs into shorter ones. In this paper, we present an effective approach to segmenting sentences based on the modified IBM translation model 1. We find that by taking into account the semantics of some words, as well as the length ratio of source and target sentences, the segmentation result is largely improved. We also discuss the effect of length factor to the segmentation result. Experiments show that our approach can improve the BLEU score of a phrase-based translation system by about 0.5 points.
DOI:	10.1109/IALP.2009.20