Mining Parallel Data from Comparable Corpora via Triangulation

This paper improves an unsupervised method for extracting parallel sentence pairs from a comparable corpus by using the triangulation through a third language. Before, an unsupervised method for extracting parallel sentence pairs from a comparable corpus has been proposed. This method is based on te...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Thi-Ngoc-Diep Do, Castelli, E., Besacier, L.
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	comparable corpus Computational linguistics Data mining extracting parallel sentence pairs Information filters Noise measurement Training triangulation method unsupervised method
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This paper improves an unsupervised method for extracting parallel sentence pairs from a comparable corpus by using the triangulation through a third language. Before, an unsupervised method for extracting parallel sentence pairs from a comparable corpus has been proposed. This method is based on technique of cross-language information retrieval with iterative process and requires no more additional parallel data. The method has been validated on the Vietnamese-French and Vietnamese-English bilingual data. In this paper, we address the problem of using triangulation through a third language to improve the parallel data mining processes: English is used in the Vietnamese-French parallel data mining process, and French is used in the Vietnamese-English parallel data mining process. The experiments conducted show that using triangulation can improve the quality of the extracted data and the quality of the translation system as well.
DOI:	10.1109/IALP.2011.57