Identifying Parallel Web Documents by Filenames
Parallel Web documents are the crucial lexical basis for constructing robust multilingual Web-based linguistic knowledge resources. To identify parallel Web documents efficiently and effectively, this paper develops a new automatic approach based on filenames using the commonly used parallel documen...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Buchkapitel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Parallel Web documents are the crucial lexical basis for constructing robust multilingual Web-based linguistic knowledge resources. To identify parallel Web documents efficiently and effectively, this paper develops a new automatic approach based on filenames using the commonly used parallel document naming practice on the Web. The approach involves three procedures for identifying common file descriptor, language flag, and language flag-pair respectively among all file names examined. To examine how these three procedures can be used to get the best result, five methods are developed by incorporating these procedures in different ways. An experimental study on a Hong Kong government Web site is conducted to evaluate the performance of these five methods in terms of recall and precision. The experimental result shows that the method combining the procedures of the file descriptor alignment and the language flag-pair alignment outperforms other methods, with a 95.3% of precision rate and a 91.0% of recall rate. |
---|---|
ISSN: | 0302-9743 1611-3349 |
DOI: | 10.1007/978-3-540-24655-8_14 |