Identifying Parallel Web Documents by Filenames

Parallel Web documents are the crucial lexical basis for constructing robust multilingual Web-based linguistic knowledge resources. To identify parallel Web documents efficiently and effectively, this paper develops a new automatic approach based on filenames using the commonly used parallel documen...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Chen, Jisong, Yeh, Chung-Hsing, Chau, Rowena
Format: Buchkapitel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Parallel Web documents are the crucial lexical basis for constructing robust multilingual Web-based linguistic knowledge resources. To identify parallel Web documents efficiently and effectively, this paper develops a new automatic approach based on filenames using the commonly used parallel document naming practice on the Web. The approach involves three procedures for identifying common file descriptor, language flag, and language flag-pair respectively among all file names examined. To examine how these three procedures can be used to get the best result, five methods are developed by incorporating these procedures in different ways. An experimental study on a Hong Kong government Web site is conducted to evaluate the performance of these five methods in terms of recall and precision. The experimental result shows that the method combining the procedures of the file descriptor alignment and the language flag-pair alignment outperforms other methods, with a 95.3% of precision rate and a 91.0% of recall rate.
ISSN:0302-9743
1611-3349
DOI:10.1007/978-3-540-24655-8_14