Large Linguistic Corpus Reduction with SCP algorithms

For pratical reasons (mainly related to costs), and to meet the quality expectations of the associated application, corpus design is a crucial issue for building rich annotated text corpora. Reducing a large corpus while maintaining sufficient linguistic richness can be formalized as a Set Covering...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Computational linguistics - Association for Computational Linguistics 2015-09, Vol.41 (3)
Hauptverfasser:	Barbot, Nelly, Boëffard, Olivier, Chevelu, Jonathan, Delhay, Arnaud
Format:	Artikel
Sprache:	eng
Schlagworte:	Computation and Language Computer Science
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	For pratical reasons (mainly related to costs), and to meet the quality expectations of the associated application, corpus design is a crucial issue for building rich annotated text corpora. Reducing a large corpus while maintaining sufficient linguistic richness can be formalized as a Set Covering Problem (SCP). Within this context, we present in this paper two algorithmic heuristics applied to design large text corpora in English and French and covering multi-represented phonological units. The first considered algorithm is a standard greedy solution with an agglomerative/spitting strategy. We propose a second algorithm based on Lagrangian relaxation. This approach provides a lower bound concerning the cost of each covering solution. This lower bound can be used as a metric to evaluate the quality of a reduced corpus whatever the algorithm applied. Experiments show that a suboptimal algorithm like a Greedy achieves good results; the cost of its solutions is not so far from the lower bound (about 4.35% for the triphoneme coverings). Usually constraints on SCP are binary, we proposed here a generalization where the constraint on each covering feature can be multi-valued.
ISSN:	0891-2017 1530-9312
DOI:	10.1162/COLIa00225