Identifying Parallel Web Documents by Filenames

Parallel Web documents are the crucial lexical basis for constructing robust multilingual Web-based linguistic knowledge resources. To identify parallel Web documents efficiently and effectively, this paper develops a new automatic approach based on filenames using the commonly used parallel documen...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Chen, Jisong, Yeh, Chung-Hsing, Chau, Rowena
Format:	Buchkapitel
Sprache:	eng
Schlagworte:	Applied sciences Candidate Pair Computer science control theory systems Computer systems and distributed systems. User interface Exact sciences and technology Information systems. Data bases Memory organisation. Data processing Pair Similarity Parallel Corpus Recall Rate Software Suffix Array
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Parallel Web documents are the crucial lexical basis for constructing robust multilingual Web-based linguistic knowledge resources. To identify parallel Web documents efficiently and effectively, this paper develops a new automatic approach based on filenames using the commonly used parallel document naming practice on the Web. The approach involves three procedures for identifying common file descriptor, language flag, and language flag-pair respectively among all file names examined. To examine how these three procedures can be used to get the best result, five methods are developed by incorporating these procedures in different ways. An experimental study on a Hong Kong government Web site is conducted to evaluate the performance of these five methods in terms of recall and precision. The experimental result shows that the method combining the procedures of the file descriptor alignment and the language flag-pair alignment outperforms other methods, with a 95.3% of precision rate and a 91.0% of recall rate.
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-540-24655-8_14