Building a large annotated corpus of english: the Penn Treebank

The process of building the Penn Treebank corpus of American English is reviewed. During the first three years the corpus, consisting of more than 4.5 million words, was tagged for part-of-speech with an estimated 3% error rate. Over 50% of the corpus has also been bracketed to reveal a skeletal syn...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Computational linguistics - Association for Computational Linguistics 1993-06, Vol.19 (2), p.313-330
Hauptverfasser:	MARCUS, M. P, SANTORINI, B, MARCINKIEWICZ, M. A
Format:	Artikel
Sprache:	eng
Schlagworte:	Applied linguistics Computational linguistics Linguistics
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The process of building the Penn Treebank corpus of American English is reviewed. During the first three years the corpus, consisting of more than 4.5 million words, was tagged for part-of-speech with an estimated 3% error rate. Over 50% of the corpus has also been bracketed to reveal a skeletal syntactic structure. Both processes are semiautomated. Tagging is first completed automatically then corrected by human annotators. This process is shown to be superior to both total automation or manual tagging alone in three ways: speed, consistency, & accuracy. Each tagging process is described in detail. The form & availability of the corpus for members of the Linguistic Data Consortium are discussed, as well as several research projects currently using the corpus. 4 Tables, 5 Figures, 18 References. M. Lemons
ISSN:	0891-2017 1530-9312