Building a large annotated corpus of english: the Penn Treebank
The process of building the Penn Treebank corpus of American English is reviewed. During the first three years the corpus, consisting of more than 4.5 million words, was tagged for part-of-speech with an estimated 3% error rate. Over 50% of the corpus has also been bracketed to reveal a skeletal syn...
Gespeichert in:
Veröffentlicht in: | Computational linguistics - Association for Computational Linguistics 1993-06, Vol.19 (2), p.313-330 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The process of building the Penn Treebank corpus of American English is reviewed. During the first three years the corpus, consisting of more than 4.5 million words, was tagged for part-of-speech with an estimated 3% error rate. Over 50% of the corpus has also been bracketed to reveal a skeletal syntactic structure. Both processes are semiautomated. Tagging is first completed automatically then corrected by human annotators. This process is shown to be superior to both total automation or manual tagging alone in three ways: speed, consistency, & accuracy. Each tagging process is described in detail. The form & availability of the corpus for members of the Linguistic Data Consortium are discussed, as well as several research projects currently using the corpus. 4 Tables, 5 Figures, 18 References. M. Lemons |
---|---|
ISSN: | 0891-2017 1530-9312 |