Large Linguistic Corpus Reduction with SCP algorithms
For pratical reasons (mainly related to costs), and to meet the quality expectations of the associated application, corpus design is a crucial issue for building rich annotated text corpora. Reducing a large corpus while maintaining sufficient linguistic richness can be formalized as a Set Covering...
Gespeichert in:
Veröffentlicht in: | Computational linguistics - Association for Computational Linguistics 2015-09, Vol.41 (3) |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | For pratical reasons (mainly related to costs), and to meet the quality expectations of the associated application, corpus design is a crucial issue for building rich annotated text corpora. Reducing a large corpus while maintaining sufficient linguistic richness can be formalized as a Set Covering Problem (SCP). Within this context, we present in this paper two algorithmic heuristics applied to design large text corpora in English and French and covering multi-represented phonological units. The first considered algorithm is a standard greedy solution with an agglomerative/spitting strategy. We propose a second algorithm based on Lagrangian relaxation. This approach provides a lower bound concerning the cost of each covering solution. This lower bound can be used as a metric to evaluate the quality of a reduced corpus whatever the algorithm applied. Experiments show that a suboptimal algorithm like a Greedy achieves good results; the cost of its solutions is not so far from the lower bound (about 4.35% for the triphoneme coverings). Usually constraints on SCP are binary, we proposed here a generalization where the constraint on each covering feature can be multi-valued. |
---|---|
ISSN: | 0891-2017 1530-9312 |
DOI: | 10.1162/COLIa00225 |