Comparing Evaluation Metrics for Sentence Boundary Detection

In recent NIST evaluations on sentence boundary detection, a single error metric was used to describe performance. Additional metrics, however, are available for such tasks, in which a word stream is partitioned into subunits. This paper compares alternative evaluation metrics - including the NIST e...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Yang Liu, Shriberg, E.
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In recent NIST evaluations on sentence boundary detection, a single error metric was used to describe performance. Additional metrics, however, are available for such tasks, in which a word stream is partitioned into subunits. This paper compares alternative evaluation metrics - including the NIST error rate, classification error rate per word boundary, precision and recall, ROC curves, DET curves, precision-recall curves, and area under the curves - and discusses advantages and disadvantages of each. Unlike many studies in machine learning, we use real data for a real task. We find benefit from using curves in addition to a single metric. Furthermore, we find that data skew has an impact on metrics, and that differences among different system outputs are more visible in precision-recall curves. Results are expected to help us better understand evaluation metrics that should be generalizable to similar language processing tasks.
ISSN:1520-6149
2379-190X
DOI:10.1109/ICASSP.2007.367194